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Abstract 

Directed information or its variants are utilized extensively in the characterization of the capacity of 
channels with memory and feedback, nonanticipative lossy data compression, and their generalizations 
to networks. 

In this paper, we derive several functional and topological properties of directed information for 
general abstract alphabets (complete separable metric spaces) using the topology of weak convergence 
of probability measures. These include convexity of the set of causally conditioned convolutional 
distributions, convexity and concavity of directed information with respect to sets of such distributions, 
weak compactness of families of causally conditioned convolutional distributions, their joint distributions 
and their marginals. Furthermore, we show lower semicontinuity of directed information, and under 
certain conditions we also establish continuity of directed information. Finally, we derive variational 
equalities of directed information, which are analogous to those of mutual information (utilized in 
Blahut-Arimoto algorithm). 

In summary, we extend the basic functional and topological properties of mutual information to 
directed information. Throughout the paper, the importance of the properties of directed information is 
discussed in the context of extremum problems of directed information. 
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I. Introduction 

Directed information quantifies the directivity of information defined by a causal sequence 
of feedback and feedforward channel conditional distributions [1], [2]. Specifically, given two 
sequences of Random Variables (RV's) X" = {Xq, Xi, . . . , Xn} E Xo,n = x7=o^i, ^ = 
{Yq,Yi, . . . ,Yn} E 3^o,n = x"=o3^i' whcrc Xi and are the input and output alphabets of 
a channel, respectively, directed information from X" to is often defined via conditional 
mutual information [2], [3]: 

n 

/(x'^ ^y") = ^/(x^;yi|r-^) (i.i) 
= S i PniK,-(%lr-) j ^^-^-^ • * > 

= Ix"->Y"(^Xi|X'-i,y'-i, ^Y,|y'-i,X' : i = 0, 1, . . . , ri). (1.3) 

For each i = 0, . . . ,n, Py^|yi-i jjc»(-h ■) is the conditional distribution of the RV given the 
causal information = y^~^, X^ = x\ Py.|yi-i(-|-) is similarly defined, and -Px%y'(")") 

is the joint distribution of the RV's (X\Y^). Using the decomposition Pxix^{d'X\dy^) = 
®j=o-Pxj|XJ-i,yj~i ® -pK,|yj-i,XJ-a.s., it is clear that directed information J(X" — )■ F") is a 
functional of two collections of causal conditional distributions, the feedforward distribution 
{Py.|yi-i_x»("h ■) : i = 0,1,..., n}, and the feedback distribution {-Pxi|x»-i,y'-i ("h ") : i = 
0,1,..., n}, hence the adopted functional notation indicated by (1.3). 

In information theory, directed information (1.1) or its variants are used to characterize capacity 
of channels with memory and feedback [4]-[ll], lossy data compression of sequential codes [ ], 
lossy data compression with feedforward information at the decoder [12], lossy data compression 
of block codes, and capacity of networks, such as, the two-way channel, the multiple access 
channel [3], [13], etc. Moreover, directed information is also utilized in a variety of problems 
subject to causality constraints, such as, gambling, portfolio theory, data compression and hy- 
pothesis testing [14], and in biology as an altemative to Granger's measure of causality [15]- 
[17]. Some of the above references derive coding theorems under any of the assumptions: (a) 
stationary ergodic processes {(Xj, i^) : i = 0,1, . . .}, (b) Dobrushin's stability of the information 
density log (g>"=o — p — -' i(dy-\y^-^) — ' v*") generalization using the information spectrum 
methods [I '']. Hence, relations between /(X" — t- F") or its information spectrum to the optimal 
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rates theoretically achievable are established in an anthology of problems of information theory. 

Directed information as initially introduced by Marko [ ' ] is obtained via a decomposition of 
Shannon's self-mutual information into two directional parts, and then taking their expecta- 
tions. Although directed information is obtained from mutual information, its functional and 
topological properties are not well understood [:>], compared to those of mutual information. 
Specific functional properties of mutual information expressed as a functional /(X"; F") = 
Ixn.Y^^{Px^, Py"\X"), of the two distributions {Px^, Py"\X"}, such as, convexity, concavity, 
and topological properties such as lower semicontinuity (with respect to the topology of weak 
convergence of probability measures), at first glance, do not translate into analogous properties 
for directed information I{X"' ¥'"■) = Ix"^y"(-Pxi|x*-i,y'-ii -Pyi|y^-i,x* '■ i = 0,1,..., n), 
as a functional of the two distributions {Px,|x*-i,y'-i, -Pyi|y*-i,x* : i = 0,1,..., n}. Simi- 
larly, it is not obvious whether the well-known variational equalities of mutual information 
utilized in the Blahut-Arimoto algorithm [19], can be extended to directed information. These 
properties together with compactness of subsets of the sets of the conditional distributions 
{Px,|js:'-i,y'-i("h ■) • ^ = 0, 1, . . . , n} and {-pK^|yi-i,x'("h ■) ■ ^ = 0, 1, . . . are fundamental 
in addressing extremum problems of directed information related to channel capacity, sequential 
and nonanticipative rate distortion, and their generalizations to networks, for abstract (e.g., con- 
tinuous) alphabets. In fact, as shown in [20] even the analysis of single letter mutual information 
defined on abstract alphabets becomes very technical, and hence the same is expected for directed 
information. 

The main objective of this paper is to determine whether the functional and topological properties 
of mutual information can be extended to corresponding properties of directed information 
defined on abstract Polish spaces (complete separable metric spaces), and to provide appropriate 
conditions for these extensions to hold. The main results are summarized below. 
Rl) Introduce an equivalent directed information definition expressed via information divergence 
D(-||-), as a functional of two consistent families of conditional distributions P(-|y) on 
X"" = xT=oXi for y = (yo,?/!, . . .) G 3^^ = x^^^y^, and Q(-|x) on for x e X^, which 
uniquely define {P^^|^i-i y,-i(-|-, ■) : z = 0, 1, . . .} and {PY^\Yi-\xi{-\-,-) : « = 0, 1, . . .}, 
respectively, and vice- versa, and their (n+l)-fold convolutional measures P Q^n{dx^\y'^~^) = 
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R2) Show convexity of the consistent families of the conditional distributions P(-|y), Q(-|x) as 

subsets of the set of regular conditional distributions; 
R3) Show convexity and concavity of directed information as a functional with respect to the 

consistent families of conditional distributions Q(-|x) and P(-|y), respectively; 
R4) Show weak compactness of the consistent families of conditional distributions P(-|x) and 

Q(-|y), and of their marginals and joint distribution; 
R5) Show lower semicontinuity of directed information as a functional of the consistent families 

of the conditional distributions P(-|y) and Q(-|x), and continuity of directed information 

as a functional of the family P(-|y); 
R6) Express directed information in terms of variational equalities involving minimization and 

maximization operations over consistent families of conditional distributions; 
R7) Illustrate that the functional and topological properties obtained in Rl)-R6) extend naturally 

to three sequences of RV's X" G A:'o,n> G 3^o,n> G 2^o,n, or more, which cover directed 

information measures for networks, and problems with side information. 

The above functional and topological properties are shown by invoking the topology of weak 
convergence of probability measures on Polish spaces and Prohorov's theorems. Some of the re- 
sults described above are obtained by utilizing some analogies between communication channels 
with memory and feedback, and stochastic control in which the control process and the controlled 
process are sequences of conditional distributions, which are analogous to {Pxi 
i = 0, 1, . . .} and {-pK^|yi-i,x'("h ■) ■ i = OA, ■ ■ •}» respectively, [21], [22]. 

From the practical point of view, there are many potential applications of the functional and 
topological properties of directed information derived in this paper Below we list some of them. 
The concavity and convexity properties are important in deriving tight bounds for converse 
coding theorems, and in identifying properties of extremum problems involving feedback capacity 
[3], [23] and sequential and nonanticipative lossy data compression for point-to-point [24], and 
network communication [25], [26]. The semicontinuity and continuity of directed information, 
and the compactness of the consistent families of distributions P(-|y) and Q(-|x) are crusial 
when addressing questions of existence of extremum solutions to problems involving feedback 
capacity, sequential and nonanticipative lossy data compression, computations of extremum 
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solutions and their properties, and derivations of coding theorems on abstract alphabets. The 
point to be made is that Rl)-R5) are fundamental properties in any extremum problem involving 
directed information. The variational equalities are important in generalizing the Blahut-Arimoto 
computation schemes to information measures involving directed information. 
For the sake of clarity, we discuss one application of the results. Consider the extremum problem 
of channel capacity with memory and feedback. Under the assumption of stationary ergodic 
processes {{Xi, Yi) : i = 0, 1, . . .} or Dobrushin's information stability, the operational definition 
of capacity is given by the following extremum problem [8]. 

C^(P) = liminf sup ^-/(X^^y^), (L4) 

n->-oo f -I n + 1 

|P^^|xi_l yi_l(-|-,-):«=0,lv,n|6Po.n(P) 

where Vo^n{P) denotes the power constraint set. The task of showing existence of a measure 
{Px,|x*-i,y»-i('K ■) : i = 0, 1, . . . , n} G Po,n(-P) which achieves the supremum in (1.4) is not 
easy. The main difficulty arises from the fact that /(X" — y") is a functional of the two 
sequences of distributions {Px.\xi-'^,Y'-'^i'\'^ ')■• ^y,|y*-\x»("h ■) '■ i = 0,1, . . . , n}, unlike mutual 
information /(X"; "K") = Ix^-y"{Px", Py"\X"), which inherits most of its properties from those 
of relative entropy between two distributions. Even for problems involving mutual information, 
Csiszar [20], [27] introduced elaborate arguments to address such extremum problems on abstract 
alphabets. However, by utilizing Rl)-R5) it is possible to show existence of such conditional 
distribution not only for (1.4), but also for the sequential and nonanticipative rate distortion 
functions, obtain generalizations involving three or more random processes, and identify the 
properties of the extremum solutions. 

The rest of the paper is structured as follows. Section II introduces two equivalent definitions 
of nonanticipative channels on abstract spaces (R1))- Section III derives the functional and 
topological properties of directed information (R2)-R5)) . Section IV derives variational equal- 
ities of directed information (R6)). Finally, Section V gives concluding remarks and identifies 
additional problems of future interest. Appendix VII gives a brief discussion of the backround 
material utilized to derive the main results and includes some of the proofs. 

II. Equivalent Nonanticipative Channels on Abstract Spaces 

In this section, our aim is to establish two equivalent definitions of the sequence of conditional 
distributions or basic processes, which define any probabilistic channel with nonanticipative 
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(causal) feedback, that relate causally the input-output behavior of the channel. This formulation 
is necessary to establish the results stated under Rl)-R7). The first definition of conditional 
distributions is the usual one found in many papers, e.g., [3], [4], [7]-[10]; it is described 
via a family of functions which are regular conditional distributions so that their convolution 
product is a family of regular conditional distributions on product alphabets (see Fig. II. 1, (a)). 
The second definition is described via a family of regular conditional distributions defined 
on product alphabets, which satisfy a certain consistency condition (see Fig. II. 1, (b)). The 
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(a) Sequence of feedback and feedforward channels (b) Consistent families of feedback and feedforward chan- 

{Px,\x'-i ,Yi-i , PY,{Y'-\xi : i = 0, 1, . . . ,n}. nels {^x"|y"-i , ^ly"|X" : n e N}. 



Fig. II. 1. Equivalent Representations of Feedback/Feedforward Channels. 



second definition is often utilized in the stochastic control literature, when both the control and 
controlled processes are conditional probability distributions [21], [22]. Indeed, the analogy is 
that {Xi : i = 0, 1, . . .} is the controlled process and {1^ : i = 0, 1, . . .} is the control process. 
The second definition is appealing from the point of view of expressing the information density 

y") = log '8)"=n '^T'''''')i '^''^ associated with directed information /(X" F"), 
in terms of two consistent families of conditional distributions, namely, Q(-|x) on given 
X = {xq, xi, . . .) G X^, and P(-|y) on given y = {yo, yi, . . .) G y^, which uniquely define 
{Py^|yi-i^x»("h ■) '■ i = 0,1-, ■ ■ ■} and {Px,|X'-i,y'-i("h ■) ■ i = 0,1, . . .}, respectively, such that 

y") = log - a.s., where z/P®Q(-) is the marginal distribution' on x^^^^y, 

obtained from P(-|y) and Q(-|x). Once the conditions on the abstract spaces {(J^j, Ai) : i = 
0,1,...} are identified, and the consistency conditions are introduced, then it can be shown that 

F") has another version given by F") = log p^j.'jffi^j.^ (x", j/") - a.s., 

where ® denotes convolution of measures. Consequently, directed information can be expressed 

'in the rest of the paper we write i/ instead of i/^'^'-i omitting its explicit dependence on P(-|y) and Q(-|x). 
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in terais of KuUback-Leibler distance JP( p^|.j^'^p«Q(.) ) • Results Rl)-R6) are obtained by utilizing 
the second equivalent definition of conditional distributions. 

Notations and Preliminaries. 

Denote the set of non-negative integers by N = {0,1,2,...}, and its restriction to a finite 
set by = {0, 1, 2, . . . , n}. Introduce two sequence of spaces {{Xn,B{Xn)) : n G N} and 
{(3^„, i3(3^„)) : n e N}, called basic measurable spaces, where A:'„,3^„,n G N are topological 
spaces, and B{Xn) and are Borel cr— algebras of subsets of Xn and 3^„, respectively. 

For each n G N define the product spaces 

The basic measurable spaces are connected to a random experiment consisting of a countable 
chain of trials with a time ordering as follows. For each n G N, let Xn and be the spaces of all 
possible outcomes at time n G N. Having performed the trials up to and including the nth time, 
the time ordering of outcomes is xo,yo, xi,yi, Xn,yn in Xo,yo, Xi,yi, Xn,yn, respec- 
tively. The probability distributions governing the next trial at time (n+1) are Pn+iiAn+i] Xq, . . . , 
Xn, yo,. . yn) and yo, ■ ■ ■ , yn, xq,. . ., Xn+i), An+i G B{Xn+i), Bn+i G B{yn+i). 

Hence, each possible outcome of the experiment is a sequence u = {xo,yo, xi,yi, . . .) with 
Xn e Xn, yn ^ yn for cach n G N. 

Consequently, define the sample space n and the algebra of all experiments by 

(fi,^) = ( X„eM {Xn X yn),QnmB{Xn) . 

Associated with the basic measurable spaces there are two basic sequences of Random Variables 
(RV's) {Xn : n G N} and {F„ : n G N}, such that for each n G N, they take values X„ G Xn 
and Yn G 3^„. These are introduced as follows. 
Let Xq,Yo, Xi,Yi, . . . be the coordinate RV's. For each n G N 

Xn{uj) = Xn, Yn{uj)=yn if u = {xq, yo, Xi, yi, . . .) . 

Clearly, Xn : {^,J^) H- {Xn,B{Xn)), Yn : {^,J^) H- {yn,B{yn)), and for each outcome oj G f2 
of the experiment, X„(a;), Yn{u>) are the results of the nth time. Similarly, X" = {Xq, . . . , Xn} 
and y" = {Yq, . . . , Yn} denote the result of the trials up to and including the nth time; they 
are RV taking values in {X^^n, B{XQ^n)) and (3^o,n, '^(D^o.n)), respectively. The objective is to 
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construct a measure P on J^) consistent with the data (e.g., measurable spaces and conditional 
distributions). 

For every n G N, define the cr-algebras generated by {Xq, Xi, . . . , X„} and {Yq, Fi, . . . , F„} by 

^(X") = a{Xo, Xi, . . . , XJ, ^(F") = a{Fo, • • • , ^n}- 
Then every event H E F{X'"') has the form 

H = {(Xo,Xi, . . . ,X„) G a} = A X A-^+i X . . . X . . . , yl G B{Xo,n) 
and is called a cylinder set with base A G S(^o,n)- Similarly, for an event J G 

J = {(yo,n, . . . G = X 3;,„+l X . . . X . . . , 5 G S(3^o,n) 

and J is a cylinder set with base B G -B(3^o,n)- 

Points in the Cartesian countable product spaces = x„gNA'„, = x„gN3^„ are denoted 
by X = {xo, xi, . . .} G A:"^, y = {?/o,yi,...} G 3^^, respectively. Similarly, for n G N, 
points in A^cn = K=o^i^ 3^o,n = xr=o3^i are denoted by x" = {xq, xi, . . . , x^} G A'o,n, 

= {?/o, Z/i, • • • , yn} e 3^o,n,, respectively. 
Let B{X^) and ;B(3^^) denote the a— algebras in X^ and 3^^, respectively, generated by cylinder 
sets (e.g., B{X^^) is the smallest Borel a— algebra containing all cylinder sets {x = (xq, xi, . . .) G 
A"^ : xo G ^0,3^1 e Ai,...,x„ G G i3(A'i),0 < 2 < n,n > 0). The Borel a- 

algebra B{X^) is denoted by Qi(zf{B{Xi). Hence, B{XQ^n) and i3(3^o,ri) denote the a— algebras 
of cylinder sets in X^ and 3^^, respectively, with bases over Ai G B{Xi), i = 0,1, ... ,n, and 
Bi G B{yi), i = 0,1, . . . ,n, respectively. 

Backward or Feedback Channel. 

Suppose for each n G N, the conditional distribution of the RV X„ G Xn is determined provided 
the values of the basic processes X"~^ = x""^ G Xq^-i and = ^/""^ G 3^o,n-i are 

known, and let {p„((ix„; x"~^, y""^) : n G N} denote this family of distributions. Here, it is 
assumed that a{X~^, Y~^} = {0, il}, hence po{dxo; x~\ y~^) = pol^^o)- The functions p.„(-; -, ■) 
are candidates of distributions of the sequence of RV's {X„ : n G N} on B{Xn)) : n E N} 
if and only if the following conditions hold. 

i) For every n E N, Pn{-', x"'~^, is a probability measure on B{Xn); 

ii) For every n E N, x"-\ ?/"-^) is a ©"Jq^ i3(3^i)) -measurable function of 
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G A'o,„_i, e 3^o,n-i, for every A„ G 

Given the collection of functions {j9„((ix„; : n G N} satisfying conditions i), ii), one 

can construct a family of distributions on the product space {X^, B{X^)) = (x jg^A'j, Qi(z^B{Xi)) 
as follows. 

Let C G B{XQ^n) be a cylinder set of the form 

C = i^-x e : xo e Co, xi e Ci, . . . , Xn e Cnj, Ci e B{Xi), <i <n. 
Define a family of measures P(-|y) on B{X^) by 

P(C|y)=/ Poidxo) [ pi{dx,;x\y')... f y^-^) (II.l) 

= ^0,n(C^0,n|2/"-'), Co,n = ^UC^. (11.2) 

The notation ^o,n(C'o,n|2/"~^) is used to denote the causal conditioning dependence of the 
measure P(-|y) defined on cylinder sets C G i3(A:o„), for any n G N. The right hand side 
of (II.l) uniquely defines a measure on {X^ ,B{X^)). Moreover, for each n G N the family of 
measures P(-|y) satisfies the following property (inherited from condition ii)): for E G B{X^), 
P(^|y) is i3(3^^) -measurable, and for E G i3(A'o,„), P(^|y) is i3(3^o,n-i) -measurable. 
Thus, if conditions i) and ii) hold then for each y G 3^^, the right hand side of (II.l) defines a 
consistent family of finite-dimensional distribution, and hence there exists a unique measure on 
{X^ ,B{X^)), for which y"~^) is obtained. This leads to the first definition of a 

feedback channel, as a family of functions Pn{dxn] x"'~^ , y"'~^) satisfying conditions i) and ii). 
This definition is used extensively by many authors [3], [4], [7]-[10]. 

An alternative, equivalent definition of a feedback channel is established as follows. Consider a 
family of measures P(-|y) on {X^ , B{X^)) satisfying the following consistency condition. 

CI: If S G B{Xo^n) then P(^o,n|y) is -B(3^o,n-i) -measurable function of y G y^. 

Clearly, if conditions i) and ii) are satisfied then CI holds for the family of measures P(-|y) 
defined via the right hand side of (II.l). The question we address next is whether for any family 
of measures P(-|y) on {X^ , B{X^)) satisfying consistency condition CI, one can construct a 
collection of functions {pn{dxn; x^~^ , y"'^^) : n eN} satisfying conditions i) and ii), which are 
connected to P(-|-) via relation (II.l). To illustrate this point, let A^"^^ = {x G X^ : XnEA}, 
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A G B{Xn), and let P(A(")|y|i3(A:'o,„_i)) denote the conditional probability of A^") with respect 
to B{Xo^n-i) calculated on the probability space {X^,B{X^),P{-\y)). Then 

P{A^^^\y\B{A:o,n-i)) = Pn{A;x^-\y^-'), A^") G S(A'o,.), (II.3) 

for P(- |y)— almost all x G X^. Clearly, the function on the right hand side of (II. 3), Pn{A; a;"~\ 2/""^) 
is i3(A'o,„_i)— measurable for a fixed A G B{Xn) and G 3^o,n-i. but it cannot be claimed 
that x""^, ?/"~^) is a probability measure on Xn. However, under the general assumption 
that {{Xn, B{Xn)) : n G N} are complete separable metric spaces (Polish spaces), with B{Xn) 
the cr— algebra of Borel sets it is shown in [21] (Appendix VI-A, Theorem VI. 16), that the right 
hand side of (II. 3) represents a version of conditional probability (a.s.) such that condition i) 
holds as well. Therefore, to establish the second equivalent definition, introduce the following 
condition on the alphabet spaces. 

iii) {Xn : G N} are complete separable metric spaces and {B{Xn) : n G N} are the cr— algebras 
of Borel sets. 

By Theorem VI. 16, Appendix VI-A, if condition iii) holds, then for any family of measures 
P(-|y) satisfying CI one can construct a collection of versions of conditional distributions 
{p„((ix„; ?/"~^) : n G N} satisfying conditions i) and ii) which are connected with P(-|y) 
via relation (II. 1), and hence the following conclusion. When {Xn : n G N} are Polish Spaces 
with {B{Xn) : n G N} the cr— algebra of Borel sets, there are two equivalent definitions 
of a feedback channel. The first definition is the usual one given by a family of functions 
a;'^~\ y"^^) satisfying conditions i) and ii). The second definition is given by a family 
of measures P(-|y) on {X'^ ,B{X'^)) depending parametrically on y G 3^^ and satisfying the 
consistency condition CI. Although, the family of measures P(-|y) on {X^ ,B{X^)) are finite 
additive probability measures, by Kolmogorov's extension theorem [28], the completeness of 
{Xn : n E H] guarantees the existence of countable additive probability measures P(-|y) on 
{X^ ,B{X^)), whose marginal on each X^^n is ^o,n('|y"~^)- 

The second equivalent definition of a feedback channel, together with an analogous equivalent 
similar definition for the forward channel will be used throughout the paper to derive the 
functional and topological properties of directed information. 

Feedforward Channel. 
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The previous methodology is repeated to obtain two equivalent definitions for the forward 
channel as well. Suppose for each n E N, the conditional distribution of the RV Yn E yn 
is determined provided the values of the basic processes Y^~^ E 3^o,n-i and X" = x" G Xq ^ 
are known, and let {qn{dyn'-i y^"^ , x"^) : n E N} denotes this family of distributions. Similarly 
define qo{dyo; , x'^) = qo{dyo:, Xq) . The collection of functions {qn{dyn]y'^~^ ^x^) : n G N} 
satisfy the following conditions. 

iv) For every n G N, is a probability measure i3(3^„,); 

v) For every n G N, y""-^, x") is a Q^I^ B{Xi)) © i3(A'„) -measurable function 
of x" G A'o,„, y'^-i G 3^o,n-i, for every B„ G 

Similarly as before, using the collection of functions ■, ■) : n G N} one can construct a 

family of measures Q(-|x) on [y^ , B{y^)) which depend parametrically on x G , as follows. 
Consider a cylinder set D E -B(3^o,n) of the form 

D=[yEy'": yoEDo, yiED,, . . . , i/„gD„}, A G ^(3^.), < z < n. 

Define a family of measures on B{y^) by 

Q(D|x) = / qo{dyo]Xo) I qi{dyi;yo, x^) . . . ^/"-^ x") (II.4) 

= "^O.nlA.nlx"), Do,n = X^^qA- (11.5) 

Since, for each x G A"^ the right hand side of (II.4) defines a consistent family of finite 
dimensional distribution, then there exist a unique measure on (y^, B{y^)) for which the family 
of distributions {qn{dyn; y^~^ , x") : n G N} is satisfied. Moreover, the family of measures 
Q(-D|x) satisfies the following consistency condition. 

C2: If F G B{yo,n), then Q(F|x) is a i3(A'o,„) -measurable function of x G X^. 

By Theorem VI. 16, Appendix VI- A, to obtain another equivalent definition for the forward 
channel introduce the following condition on the output alphabet. 

vi) {yn ■ n eN} are Polish Spaces and {B{yn) : n eN} are the a— algebra of Borel sets. 

Then for any family of measures Q(-|x) on (3^^, i3(3^^)) satisfying consistency condition C2 
(under condition vi)), one can construct a collection of functions ■, ■) : n G N} satisfying 

conditions iv) and v), which are connected with Q(-|x) via relation (II.4). Therefore, we arrive 
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at two equivalent definitions for the forward channel as well. 

We conclude this section by constructing the probability space (fi, J^, P), as stated earlier, and 
the sequence of RV's {{Xi, Kj) : i = 0,1, . . . ,n} defined on it. Given the basic measures P(-|y) 
on satisfying consistency condition CI and Q(-|x) on satisfying consistency condition 
C2, one can construct a sequence of RV's : n G N} or conditional distributions as 

follows. 

Let = {x : XneA}, A e B{Xn) and = {y : |/„gS}, B e In addition, 

let P(A(")|y|S(Afo,„_i)) denote the conditional probability of A^") with respect to B{Xo^ri-i) 

calculated on the probability space {X^,B{A:^),P{-\y)), and Q(5(")|x|i5(3^o,n-i)) denote the 

conditional probability of B^'^^ with respect to i3(3^o,n-i) calculated on the probability space 

(3^^S(3^^),Q(-|x)). 

Then by conditioning it follows that 

P{X„gA„|X"-i = x^-\Y--' = = P({x : x^EA^}\y\BiXo,n^i)), A„Gi3(A'„) 

= p„(A„;x"-\y"-i) (n.6) 

P{y„GS„|F"-i = = x"} = Q({y : y„GS„,}|x|i3(3^o,n-i)) , 5„GS(3^n) 

= q4B^;f'-\x^) (II.7) 

for almost all x G in measure P(-|y), and for almost all y G 3^^ in measure Q(-|x). Note 
that -, ■) G QiXn;Xo,n~i,yo,n~i) and ■, ■) e Q(3^„; 3^o,n-i, '^n) are stochastic kernels 
(see Definition VI. 15, Appendix VI-A) determined from P(-|-) and Q(-|-)) respectively, (e.g., 
they are related via (III) and (II.4), respectively). 

Consequently, the joint distribution of the RV's {{Xi, Yj) : i E N} is defined by 

F{XoeAo, YoeBo,..., X„gA„, Y^eBn} = / Poidxo) / qo{dyo; Xq) . . . 

JAo J Bo 

J An J Bn 

(11.8) 

Hence, given the two Polish spaces and 3^^, for any P(-|-) and Q(-|-) satisfying the 
consistency conditions CI, C2, respectively, there exist a probability space and a sequence 
of RV's {{Xi, Yi) : i E N} defined on it, whose joint probability distribution is uniquely defined 
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by (II.8), via P(-|-) and Q(-|-). 

The following remark summarizes the previous discussion on the two equivalent definitions 
of forward and feedback channels using the definition of stochastic kernels (regular conditional 
distributions and stochastic kernels are equivalent notions, see Appendix VI-A, Definition VI. 13). 

Remark II. 1. 

Suppose {Xn : n G N}, : tt, G N}, are complete, separable metric spaces (Polish spaces) 
and {B{Xn) : n G N}, {B{yn) : n G N} are respectively, the a— algebras of Borel sets. 
Then 

1) The collection of stochastic kernels' {Pni']-,-) ^ Qi'^n', '^o,n-i x 3^o,n-i) : n G N} uniquely 
define a family of probability measures on (X^ , B{X^)) via (II. 1). 

2) For any family of probability measures P(-|y) on (X^, B{X^)) satisfying consistency condition 
CI there exists a collection of stochastic kernels {p„(-; ■, ■) G Q{Xn; A'o.n-i x 3^o,n-i) : n G N} 
connected to P(-|-) via (III). 

3) The collection of stochastic kernels ■) £ Q(3^n; 3^o,n-i x Xo^n) : n ^N} uniquely define 
a family of probability measures on {y^ , B{y^)) via (II.4). 

4) For any family of probability measures Q(-|x) on {y^, B{y^)) satisfying consistency condition 
C2 there exists a collection of stochastic kernels {qn{-'r,-) £ Q(3^n; 3^o,n-i x Xq^^) : n E N} 
connected to Q(-|-) via (II.4). 

The point to be made here is that directed information as defined by (I.l)-(1.3) can be expressed 
via the equivalent definitions of Remark III, 2) and 4) rather than 1) and 3). Via this equivalent 
definition of directed information, the functional and topological properties of mutual information 
I{X^] y") can be extended to directed information on general abstract spaces. Throughout the 
rest of the paper it is assumed that the conditions of Remark II. 1 are satisfied. Indeed, according 
to Remark III, all that is required is Polish spaces, which is assumed throughout the rest of 
the paper. However, if the spaces {{Xn,yn) : n E N} are not complete, countably additivity 
of the family of probability measures P(-|y) and Q(-|x) does not fail. This is because these 
spaces can be homeomorphically embedded as Borel subsets in complete separable metric spaces 
{{Xn, yn) '■ n E N}, so that Kolmogorov's extension theorem can be utilized [22]. 

^Q{X\y X Z) denotes the set of stochastic kernels on X given y x Z. 
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III. Properties of Directed Information 

In this section, we define the feedforward information /(X" — Y"-) and the feedback 
information /(X" ^ F") on abstract spaces (Polish spaces), via the KuUback-Leibler distance 
(or relative entropy), using the basic family of measures P(-|y) on (X^, B{X^)), and Q(-|x) on 
{y^, B{y^)), which satisfy consistency condition Cl and C2, respectively, and Radon-Nikodym 
Derivatives (RNDs). Once this is established, then following Pinsker [29] it will become obvious 
that directed information permits a representation as a supremum of relative entropy between 
two distributions, where the supremum is taken over all measurable partitions on a given a— 
algebra of subsets of a set Z. Further, in a subsequent subsection, we use the definition of 
directed information in terms of P(-|y) and Q(-|x), to derive several of its properties, such as, 
convexity, concavity, lower semicontinuity, with respect to these two families of measures. 
To present the precise expression for the directed information, following Csiszar [20], we first 
introduce the measures of interest constructed from the basic consistent families of conditional 
distributions. Let A^i(S) denote the set of probability measures on the measurable space (S, 
i3(S)). Introduce the following notation. 

Q'^^iX^; y^) = {p(-|x) G Mi{X^) : P(-|x) are regular probability measures 
satisfying consistency condition Clj 

gC2^-yN. A |Q(.|y) ^ Mi{y^) : Q(-|y) are regular probability measures 

satisfying consistency condition Clj. 

The projection of Q'^"'^(A'^; 3^^) and Q^^{y^;X^) to finite number of coordinates is denoted 
by Q*^^(A'o,„,;3^o,n-i) and Q*^^(3^o,n; -^o.n), respectively. 

Next, we define the distributions of interest. Given any conditional distributions P(-|-) G Q^^{X^] y^) 
and Q(-|-) G Q^^{y^;X^), utilizing the construction of Section II, we can define uniquely 
{p„(-; -, ■) : n G N} and {g„(-; -, ■) : n G N}, (see (II.6), (II.7)) and the following distributions. 

PI: The joint distribution on x y^ of the basic sequence {Xn, : n E N} constructed 
from P(-|y) G Q^^iX^;y^) and Q(-|x) G Q'^^iy^; X^), defined uniquely for A, G BiXi), 
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5, ei3(3^i),zGN", by 
CPo,n ® ■3o,n)(xr=o(^iX^«))=P{^oeAo,yo ^ Sq, . . . , x„eA„, y„eB„j 

= / Poidxo) qo{dyo; xo) . . . p„((ia;„; a;""\ y"^^) / g„((i?/„; x"). (III.l) 

Formally, (III.l) is written as (^o,n ® '^oJ{dx'',dy'') or ^o,n(c?x"|?/"-^) ® "^o.nl^^Z/^k")- 

P2: The marginal distributions on of the sequence {X„ : n e N} constructed from P(-|y) E 
QCi^^N.yN^ and Q(-|x) e Q^^{y^]X^), defined uniquely by^ 

/«o,n(xr=oA) = P{^o G ^,lo G 3^0, . . . e K G A G ^ e W (IIL2) 

= (Kn®3o,n)(xr=o(Ax3^.)) 

Poidxo) qo{dyo; Xo) . . . Pn{dXn] x""'^ ,y''~'^) g„(dy„; ?/""\ x'"). 

(III.3) 

Formally, (III. 2) is written as ^o,n{dx'^) = (^o,n®^o,n)('^^"> yo,n), and by Bayes' rule fiQ^nidx"-) ■ 
(g)^^ofiili-i{dxi;x'^^). 

P3: The marginal distributions on of the sequence {Yn : n E N} constructed from P(-|y) E 
Q^'-{X^;y^) and Q(-|x) E Q^"" {y^ ; X^) , defined uniquely by 

lyo,n{^7=oB^) = ^{Xo E Xo, YoEBo,...,X^E X^, Yn E E B{y,), i E (III.4) 

= (Hn®4,n,)(xr=o('^.xS,)) 



Po{dxo) Qoidyo; xo) . . . Pnidx^^x"^ , )/ qnidy^^y'^ 

Xq J Bo J Xn J Bn 

(III.5) 

Formally, (III.4) is written as fo,n(<^l/") = (^o,n®^o,n)('^o,n, dy"'), and by Bayes' rule i'o,n{dy'^) = 
^i=o^i\i-i{dyi;y'-^). 

P4: The measure lto,„ : i3(A'o,„)0S(3^o,n) ^ [0, 1] constructed from ^o,n(-|?/"-') E Q^\Xo,n, 3^o,n-i^ 

^Actually /i = /i^®*^ but we omit the superscript throughout the paper. 
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and //o,n(c?Z/") = (^o,n ® '5o,n)('^o,n, c??/") ^ A^i(3^o,n) of (III.4), defined uniquely by 



Po{dxo) / Mdyo) / pi(cixi;a;o,|/o) / i^i\o{dyi; yo) . . . 

Ao J Bo JAi JBi 

... f Pn{dXn,X^-\y'-''') f iyn\n-l{dyn,y''~'). (III.6) 

J An J Bn 

Formally, (III. 6) is written as ^o^n{dx^,dy^) = ^ o^n{dx'^\y^~'^) ® i'o,n{dy^). 

P5: The measure !To,„ : l3{yo^n)&B{Xo,n) ^ [0, 1] constructed from ^o „((iy"|x") G Q^^^CVcn; '^o.n) 
and fiQ^nidx'') = {Po,n ® ^o,n)(^^"' 3^o,n) ^ -Mil'^o.n) of (III.3), defined uniquely by 



fioidxo) qo{dyo;xo) fiiio{dxi; xq) qi{dyi;yo,Xo) . . . 

Ao J Bo J Ai JBi 

... Hn\n-i{dxn; x""-^) g„ (dj/^; ?/"-\ x") . (III.7) 

J An J Bn 

Formally, (III.7) is written as tTo,n('^2;", (i?/") = /io,n('^x") ® ~^Q j^dy'^\x^). 

From the above definitions, an alternative way to construct the conditional distributions Vn\n-\[-\ ■) ^ 
Q(3^n; 3^o,n-i) and G Q(A'„; Afo^n^i) is as follows. Let A^") = {x : x„ G A}, 

A G = {y : y„ G 5}, S G and let lto,n(A("\ 5W|S(A'o,„„i) © i3(3^o,n-i)) 

denote the joint conditional probability of A^"^ x B^^^ with respect to i3(A'o,n-i) 0'B(3^o,n-i) cal- 
culated on the probability space (^A"^®}^^, i3(A'^)0i3(3^^), l!o,„(-)) • Then for A G i3(A'„), S G 
'B(3^ri) we obtain 

l!o,n(A('^),S(")|i3(A'o,„-i) ©i3(3^o,n-i)) =Pn(A;a;'^-\2/'^-i) x i.„|„„i(5; ^/^^-i). (III.8) 

Hence, z/„|„_i(-; ■) G Q.{yn\ 3^o,n-i) is given by Vr,\n-\{dyn\ |/""^) = X^.^^ I?o,n(fia;„, x""\ 2/''-^), 
from which v^^^idy^) G A^iC^cn) is also obtained. Similarly, let tTo,„(A("), |i3(3^o,n-i) © 
S(A'o^„_i)) denote the joint conditional probability of A^^^ x B^"^^ with respect to -B(3^o,n-i) © 
S(A'o,„_i) calculated on the probability space (3^^ x A'^,fi(3^^) © tlo,n(-)) • Then for 

B G 

tTo,n(A("\s('^)|S(A'o,n-i)©S(3^o,n-i))= / (S; x") © x^'^) (III.9) 
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from which fin\n~ii-', ■) G Qi^n', '^o,n~i) and fiQ^nidx"-) G A^i(A:b,n) are obtained. Similarly, from 
(III. 6) and (III. 8) we can obtain any of the individual kemels ■) and qn{-; ■, ■) appearing 

in their right hand side by proper conditional expectations. 

Using the first definition of basic processes, that is, given a collection of stochastic kemels 
{Pn{-; ■) e Q{Xn; A'o,„_ix3^o,n-i) : n e N} and ■) e Q(3^„; 3^o,n-iX A'o,„) : n E N}, the 

joint distribution, as well as the conditional distributions are defined via PI — P5. Consequently, 
it is well-known that directed information is defined via relative entropy as follows [8] (see 
Appendix VI-A, pp. 47-48). 

n 

= E / / ( p r^f°ftv"^"'"7P . PoAdx\ dy,, y^-')Po,-.W-') 

^ i^o.-i iA-o.xy, \PQ,i{dx'] y 1) X i^i\i^i{dyi- y' i) J 

(III. 10) 

n „ 

= ^ / D(g,(-;2/'-\x^)||i/,|,-i(-;?/^-'))p^(^ix,;x^-\2/^-i) 

®}=o ®P,(ci2:,;x^-\y^-i)) (III.ll) 

= Ix"->yn(pi(-; -, ■),gi(-; -, ■) : 2 = 0, 1, . . . , n). (III.12) 

The right hand side in (III. 10) follows from the definition of conditional mutual information, 
while we use the notation Ix^^y^iPii-'-, ■, ■), Qii'', ■) • ^ = 0, 1, . . . , n) to denote that /(X" — )• 
y") is a functional of {pi{-;-,-),qi{-r,-) ■ i = 0,1,..., n}. Additional comments regarding 
(in.lO)-(III.12) are given in Appendix VII-A (pp. 48-50). 



A. Directed Information Functional of Consistent Conditional Distributions 

Now we consider the second definition of basic process introduced in Section II. Given any 
family of measures P(-|y) G Q'^'^{X^]y^) and Q(-|x) G Q'^'^{X^]y^) the measures under 
PI — P5 are constructed. Next, we define directed information via relative entropy as done 
for mutual information [ ]. By Lemma VI. 14, ^o,n ® ^on << ^o,n ® ^o,n if and only if 
^o,n('l^") << ^o,n(') for ^o,n— almost all x" G Xq^- Utilizing the Radon-Nikodym derivative 



(RND) '^(-^"■"'^^o,n) ^^n^ y-n-^^ define the relative entropy of ^0 n ® with respect to I?on as 

d(-Po,n®i'0,n) ' ' ' 
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follows. 



log 



(111.13) 

( ' J 1^ ® Qo,n){dx'',dy''). (III.14) 



Note that (III. 14) is obtained by utilizing the fact that if ^o,n ® ^o,n << ^o,n ® t'o.n then the 
RND represents a version of ^"'"^'.'f^ y"), ^o,n - a-s for all G A'o,„. 

On the other hand, using Lemma VI. 14, Vo,n('k") '^o,ni-), -Po,n— almost x" G A'o^n, and by 
Radon-Nikodym theorem, there exists a version of the RND ^o.n(3^") ?/") = °„ (.) (?/") which 
is a non-negative measurable function of (x", y*^) G A'o,„ x 3^o,n- Hence another version of ^o,n('5 ■) 
is ^on(a^", = (x", j/"). We use the previous notation Ix^^y^C^ o n,'^ o n) to iUus- 

tratethatD(Po,n®Qo,nl|no,n) is a functional of { Po,n(-|?/"~^), Qo,n(-k")} e Q*^^(A'o,„; 3^o,n-i) 

In the next Remark we summarize the equivalent definitions of directed information based on 
the two equivalent definitions of channels, that is, the one based on (III. 11), (III. 12), and the 
one based on (III.13), (III.14). 

Remark III.l. 

Let P(-l-) G Q'^^iX^; y^) and Q(-|-) G Q'^^{y^; X^). By repeated application of Lemma VI.14, 
and the chain rule of relative entropy, Theorem VI. 19 of Appendix VI-A, directed information 



X 
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admits the following equivalent definitions. 

A 

1=0 



/(X" ^ F") = ^ I{X'- Yi\Y'-^) (111.15) 



5=0 [qMVv y'~\ ® Pjidx'; x^-\ y'-^)) (III.16) 

'^0,n®'3o,nl|I?0,n) (HI. 17) 

= / log ( ^°'"^ff If ^ )(Kn ® 4,J(rfx",dj/") (111.18) 

= Ix"^yn(1^0,n,'3o,n)- (111.19) 

Clearly, (III. 19) is valid even when (^o,n ® ^o,n)(^^"; '^?/") is singular with respect to {Po,n ® 
i^o,n){dx"', dy^), in which case its value is +00. The point to be made here is that we will show 
the convexity, concavity, lower semicontinuity properties of directed information /(X" — F"), 
using the functional Ix^^y^C^ o,n, ^o,n)) with respect to the family of measures ^o,n(-|2/"~^) G 

o„(-|a:") G Q^'^iyo^ni '^o,n)- We will also use the directed information 
definition D(^o,n®'3o nil if o,n)? as a functional of { Po,n) Qo n) to show lower semicontinuity, 
convexity and concavity properties. These functional and topological properties are important 
in showing existence of extremum problems, such as, maximizing directed information over 
channel input distributions ^o,n("|y"~^) £ Q^^{.'^o,n',yo,n-i) satisfying power constraints, for 
capacity calculations, and minimizing directed information over ^on('l^") ^ Q^^(3^o,n; '^o.n) 
satisfying fidelity constraints for sequential and nonanticipative rate distortion functions. 

Remark III.2. 

Note that one may investigate directed information ■) ^ ') • ^ = 0, 1, . . . , n) 

as a functional on the space of vector measures {po,Pi, • • • ,Pn} G xr=o2('^«! ^o,i~i x 3^o,i-i) 
and {q'o, 91, • • • , 9n} G 3^o,i-i x '^o.i)- However, it is not clear to us whether this 

formulation is appropriate to obtain the results Rl )-R7). 

Feedback Information. 

Similar to the above discussion we can also define feedback information as follows. It can be 
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easily verified that ^o,n ® ^o,n-i << ^o,n-i x l^o,n if and only if ^o,n(-|z/""^) << Ato,n(-) 
for i^o,n-i -almost all y""^ G 3^o,n-i- Moreover, if ^o,™ ® ^o,n-i << ^o,n-i ® /"o,n then 
the RND '^(^"■"^^o.n-i) /^n^^n-n j-gpresents a version of '^^°'"^''f ''^ (x"), i/Qn-i - a.s for all 

y"-'^ e 3^o,n-i- Hence, when ^o,n ® ^o,n-i << 3o,n-i ® A^o.n, then 

n 



1=0 
n 



A 



i=0 "''^0,1-1 xyo.i-i y ^ 
D(^0,n®'3o,„-l||tlo,n) 

^^^l^^^ ){P o,n ^ Qo,n-i){dx ,dy ) 

(111.20) 

= / l0g( ^°'"^^fJ'!7'^ )(Kn.®4n-l)(^^".^^"~^) (111.21) 

= IIx"-f-y"('^o,»i-i' ^o,n)- (III.22) 

Note that (111.22) states that directed information I(X^ ^ F") is expressed as a functional of 
{^o,n, ^o,n-i} and it is denoted by Ix^^y^i^ o.n^i, ^o,n)- We will not pursue the properties 
of (III. 22) since these follow along the same line of reasoning as those of /(X" — )■ Y^). 
We also point out that directed information can be equivalently defined as a supremum over 
appropriate partition of measurable spaces with respect to the families P(-|-) G Q^^{X^;y^) 
and Q(-|-) G Q'^'^{y^; X^). This construction is elaborated in Appendix VII-A. 



We are now ready to investigate the functional properties of directed information, as a functional 
of the space of input conditional distributions 3^^), and the space of channel condi- 

tional distributions Q'^'^{y^; X^), using the relative entropy definition of directed information, 
/(X- ^ F") = Ix-^YA^o,n,'^o,n)^ as a function of ^o,n(-|?/"~') e Q^\Xo,n,yo,n^i) and 
^on('l^") ^ Q^'^{yo,n', ^Qn)- Rccall that for mutual information defined via relative entropy, 
analogous properties, such as, convexity, concavity, lower semicontinuity etc, are well known 
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[20], [27], [30]. We shall derive similar properties for the functional Ix"-i.y"(^o,n, ^on) 



B. Convexity and Concavity of Directed Information 

First, we show that the family of regular conditional distributions P(-|y) and Q(-|x) sat- 
isfying consistency conditions CI and C2, are convex, and then we show convexity of di- 
rected information with respect to Q(-|x) and concavity with respect to P(-|y). We recall the 
definition of convexity of the set of regular conditional measures given in Definition VI. 17 
(see Appendix VI-A, p. 47). Let Mi{X^ x y^,B{X^) B{y^),B{y^),P) denote the set of 
probability measures on (X^ x 3^^,i3(A:'^) B{y^)), which are absolutely continuous with 
respect to a fixed reference probability measure P, and have restrictions on B{y^) which 
are equivalent (mutually absolutely continuous) with respect to Pj^^^jj^. Hence, for any P G 
MiiX"" X 3^^,S(A'^) 0i3(3;^),i3(3;^),P), by definition, P « P and PI^^^..^ « P\^^y,y 
PL(yN) « P|6(yM) denoted by P\^^y,^ ~ P\^^y,y 
Introduce the set of all regular conditional probability measures 

A^i(a'^ X 3^^,-B(A^^) 0S(3^^),P|-B(3^^))(y) 

= {p(-|y) : P G X 3^N,i3(A'N)0i3(3^^),i3(3^^),P)}. (III.23) 

Define the subset of the above set consisting of all conditional distributions which satisfy 
consistency condition CI as follows. 

Qci^-^N.yN^ A |p^_|^^ ^ ^^f^^^ ^ y'',B{x'')QB{y''),P\B{y'')yy) 

P(-|y) satisfies consistency condition Cl|. 
The projection to finite coordinates is defined by 

Q'^'(^o,n;3^o,n-i) = {P(-|y) e x 3;^i3(A'^) 0fi(3^^),P|S(3;o,n-i))(i/"^') : 

P(-|y) satisfies consistency condition Cl|. (III. 24) 

Similarly, define the set of measures Q(-|x) which satisfy consistency condition C2 by 
QC2^yN.^N-^ A ^ Mil^X"" X 3^^,i3(A'^) 0S(3;^),P|i3(A'^))(x) : 

Q(-|x) satisfies consistency condition C2 L (III.25) 
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and its projection to finite coordinates by 

Q''\yo,n.X,,n) = {q(-|x) e M,{x'' X 3^^,S(A'^) 0S(3^^),P|i3(A'o,„))(x") : 

Q(-|x) satisfies consistency condition C2|. (III.26) 
Next, we show convexity of the sets Q^"'^(A'o,„; J^o.n-i) and Q*^^(3^o,n; -^o.n)- 
Theorem III.3. 

Let {Xn : n G N}, {3^„ : n G N} Polish spaces with B{Xn), -B(3^„), respectively, the 
a— algebras of Borel sets. Then the sets Q^^{Xo^n] 3^o,n-i) and Q*^^(3^o,n; '^o.n) ^'''e convex 
sets of the sets of regular conditional distributions Mi [x^xy^, i3(A'^)0S(3^^), P|S(3^o,n-i) j (?/"~^) 
anJ Ml [X^^ X 3^^^, S(3^^), P|i3(A'o,„)) (x'^), respectively. 

Proof. The a.s. -convexity of the sets Q^"'^(A'o,n; ^cn-i) and Q*^^(3^o,n; '^o.n) will be shown 
using Definition VI. 17. Since the methodology is similar for both sets, only the derivation 
of Q'^^(A'o,„;3^o,n-i) will be given. It is well-known that the sets mAx^^ x y^,B{X^) 
P|i5(3^o,n~i)) (y""') and >li {x^'xy'', i3(A'^)0i3(3^^), P|i3(A'o,„) j (x") are convex. By 
the definition of a.s. -convexity of the set of regular conditional distributions (Definition VI. 17), 
Q'^'^{Xo,n] 3^o,n-i) is convex (almost surely) if for a given pi(-|y), p2(.|y) e Q<^^A'o,n; 3^o,n-i), 
and a given A G (0, 1), there exists a probability measure P on (A:'^x3^^, B{X'^^)QB{y^)), whose 
regular conditional measure P(-|y) G Mi{X^'^ x 3^^, i3(A'^) i3(3^^), P|i3(3^o,n-i))(?/""^) is a 
convex combination P(-|y) = AP^(-|y) + (l — A)P^(-|y), Pjg^-y^ ^ — a.s., and consistency con- 
dition CI holds. This property is denoted by APi(-|y) + (l-A)p2(-|y) G Q*^^(A'o,„; 3^o,n-i)- By 
[31] the set of regular conditional distributions Mi[X^xy^\ S(A'^)0-B(3^^), P|-B(3^o,n-i)) (y""^) 
is a convex set. Also, Q^^A'o.^; 3^o,n~i) C A^i (A'^^ x 3^^, S(A'^) 0^(3^^^), P|S(3^o,n-i)) 
and hence if P^-ly), P^(-|y) e Q*^^(A'o,„; 3^o,n-i) then they are included in Mi {X^xy^\ B{X^^)Q 
B{y^) ,V\B{yQ^n-i)){y^~^), and hence they satisfy the property 

Api(-ly) + (1 - A)p2(-|y) G x 3^^,S(A'^) i3(3^^), P|i3(3^o,„-i)) (y'^-^), VA G (0, 1). 

To complete the derivation it is sufficient to show that the consistency condition Cl holds. Let 

= {x : x„gA}, a G B{Xn). In addition, let P(y4(")|y|S(A'o,„-i)) denote the conditional 
probability of A^"^'^ with respect to i3(A'o,„_i) calculated on the probability space (A"^, B{X^), P(-|y)) . 
From the definition of regular conditional probability measures on Mi{^X^ x 3^^,i3(A:'^) 
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B{y^),P\B{yo^n^i)) it follows that 

P(A(")|y|^('^'o,n-i)) = Api(A(")|y|S(A'o,„_i)) + (1 - X)P\A^^^ \y\B{Xo,n-i)) - a.s. 

= Xplidxn^x'^-^y"-') + (1 - X)plidxn;x'^-\y^~') - a.s. 

where ■, ■) and ■, ■) are regular conditional distributions. Since convex combination of 
regular conditional distributions is also a regular conditional distribution, then AP^(-|y) + (1 — 
A)P^(-|y) G Q^''^(^o,n; 3^o,n-i), e-g. the consistency condition CI holds, and the derivation is 
complete. The derivation of a.s. -convexity of the set Q^^(3^o,n; ^o,n) is done similarly. □ 

Since Q^'^{Xo^n', 3^o,n-i) and Q*~'^(3^o,n; ^o,n) are convex, then we proceed further to show that 
directed information Ix"^y"(^o,n, ^o,n)' as a functional of ^o,n("|?/"~^) G Q'^^i'^o,n',yo,n~i) 
for a fixed ^o,n('k") ^ Q^^(3^o,n; '^o.n) is concave, and as a functional of ^o.n('k") ^ 
Q*^^(3^o,n; '^o,n) for a fixed Po,n('|?/"~^) G Q^"'"('^o,n; 3^o,n-i) is convex. These results are shown 
in the next theorem. 

Theorem III.4. 

Let {Xn : n G N}, {3^„ : n E fi} be Polish spaces with B{Xn), i3(3^„), respectively, the 
a— algebras of Borel sets. Consider the directed information functional /(X" — )■ Y"-) = Ix"^y" 
(Kn,'3o,n), Ix-^y- : Q^\Xo,n;yo,n-i) X Q^' (3^o,n; A'o,„) ^ [O, oo] defined by (III.19). 
Then the following hold. 

1) Ix"_^Y" (^o,n, ^o,n) « convcx functional of'^Q ,^ G Q'-'^O^o.n; '^o,n) for a fixed ^o,n e 

2) Ix"-i.y" (-Po,ni ^o,ri) '■^ ^ concave functional of Po,n £ Q^"'"('^o,n; 3^o,n-i) /o?" a fixed ^o,n ^ 

5j Ix"^.y"( -Po,n, ■) is a strictly convex functional on the set {^q,™ ^ 2^^(3^o,n; '^o.n) ^ Ix"^y"(-Po,n5 

< oo} for a fixed Po,n G Q^'^{Xo^n', 3^0,ri-l)- 

Proof. By Theorem III. 3, the sets Q'^'^{Xo,n', 3^o,n-i) and Q^^(3^o,n; Xo,n) are convex. Therefore, 
to show parts 1), 2), 3) we utilize the consistency of the two families of conditional distributions 
and we apply the log-sum formulae. The complete derivation is given in Appendix VII-B. □ 

Theorem 1114 is analogous to similar results for mutual information /(X"; = Ix^-Y"iPx^, Py'^\X" ), 
expressed as a functional of input distribution Px" {dx") G A^i(Ao,„) and the channel Py^\x" {dy^\x'^) G 
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Q{yo,n', '^o,n), which is a convex (respectively concave) functional of Pyn|x" {dy"-\x"-) G Q{yo,n', '^o 
(respectively Px^{dx"-) G A^i(A:'o,n)) forafixed Px^idx^) G A^i(A'o,n) (respectively PY^\x^{dy'^\ 
Q(3^o,n; '^o.n))- It is important to point out that if one considers the alternative definition of di- 
rected information (III. 10), (III. 12), as a functional of the sequence of input channel distributions, 
/(X" — !■ y") = ■)> ■> ■) • ^ = 0, 1, . . . , n), then it is not clear to us whether 

it is possible to establish convexity and concavity with respect to qi and pi. 

C. Weak Convergence and Compactness of Conditional Distributions 

In this section we give general sufficient conditions for weak compactness of the set of mea- 
sures Q^^{X{),n'-,yo,n~i) and Q^^(3^o,n; '^o,n), and compactness of the set of joint and marginal 
measures with respect to the topology of weak convergence of probability measures. These 
conditions are sufficient to show lower semicontinuity of Ix^^^y^Cp o,n, ^on) fixed Po,n £ 
Q'^^('Vo,n;3^o,n-i) (respectively ^o,n e Q*^^(3^o,n; -^o.n)) with respect to ^o,n, ^ Q*^^(3^o,n; -^O.n) 
(respectively Po,n G Q!^^{Xo,n-, 3^o,n-i))- These results are the analogous to the lower semicon- 
tinuity of mutual information, extensively utilized in information theory and statistics (see [2 ;], 
[27]). 

Before we state the main theorem, we introduce the following notation. Let BC{X) denote 
the set of bounded, continuous real-valued function / defined on the metric space {X, d) 
endowed with the supremum norm ||/|| = sup^.^^:,. 1/(3^)1- A sequence of probability measures 
{Pn : n > 1} C Mi{X) is said to converge weakly to P G Mi{X) if [32] 

lim I f{x)dPn{x) = I f{x)dP{x), V/ G BC{X). 



X 



Weak convergence of {P„ : ri > 1} to P is denoted by P„ -—^ P. Appendix VI- A summarizes 
the main results of weak convergence, compactness, tightness, and Prohorov's theorem which 
we will invoke to derive the results of this section. 

Throughout sequences of points in A"^ and are denoted by x*^°) = {xq"\ . . .} G X^ , 

yW = {yj"\y5"\...} G 3^^, a = 0, 1,2, . . . Moreover, x(°) G X^^ is said to converge to 

x*^°) G X^ as a — )■ oo, if x^n^ — )■ Xn\iox all n G N. Sequence of points in and 

are denoted by a;"''^"') = {x^q\x^i \ . . . , x^n^} and = {?/o"'', • • • , ?/!"''}, « = 0, 1, ... , 

respectively. 

Next, we state the general novel theorem which is also used to show lower semicontinuity of 
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directed information. The theorem consists of two parts depending on whether 3^o,n is com- 
pact and weakly continuous, or n is compact and g„(-; x") is weakly 
continuous. In applications of information theory either one of them or both maybe required. 

Theorem III.5. 

Part A. Let 3^o,n be a compact Polish space and Ao„ a Polish space. Assume Fo,n(-|?/"~^) £ 
Q^-'^(A:'o,„; 3^o,n~i) satisfy the following condition. 

CA. - For all g{-)EBC{XQ n), the function 

2/"-^) G A'o,„_i X 3^o,n-i ^ / g{x)pn{dx- x^-\ 2/""^) G M (III.27) 
is continuous jointly in the variables (a;"~^, ?/"~^) G X^ n-i x 3^o,n-i- 

r/?en following hold. 
Al) Let Kn(-|?/"~') e Q'''('^'o,n;3^o,n-i) and C^o^J-lx") : a = 1,2,...} C Q^^(3^o,n; A-q,™). 
Then the joint measure {P Q^n®Q On) {dx^ ,dy^) — > {PQ^n®Qon)idx^,dy"'), where Qon('l^") ^ 

A2j Lef Kn(-b""') e Q^'('^0,n;3^0,n-l) { "3o,„ (■ 1 3^") : « = 1, 2, . . . } C Q^'(yo,n;A'o,„) 

anJ define the family of joint measures {(Po,n ® n)(^^"' '^?/") • = 1)2,...} having 
marginals {^q „ : a = 1,2, ...} on 3^o,n ^^^^ {/"o,n • = li 2, . . .} on Xq^^- Then 
'^o,nidy"-) ^nd iJ.Q,,{dx'') w/zere G A^i(3^o,n) ^"(i e A^i('^o,n) '3'"^ ^/^e 

marginals of {Po,n® Vo n)(^^"> c^?/")- 

Aij r/ze i'e? of measures Q^"'^(^o,n; 3^o,n-i) '■^ weakly compact. 

A4) The set of measures Q^^(3^o,n; '^o,n) '■s' weakly compact. 

A5) Let G Q^^(A'o,n;3^o,n-i), {'3o,J-|a:") : a = 1,2,...} C Q^^(3^o,n; A'o.O, 

and {i^Q n : « = 1, 2, . . .} are the marginals o/ |(Po,n ® ^o,n 

){dx'^,dy'^) : a = 1,2, . . . }. Then 

w/?ere G A^i(3^o,n) ^■^ /zmzY o/ i^^ „ G A^i(3^o,n). 

Part B. Let Ao,„ fii compact Polish space and 3^o,n Polish space. Assume ^o,n('l^") ^ 
QP'^iyo,n] '^o,n) satisfy the following condition. 

CB. ' For all h{-)&BC{yo,n), the function 

{x\ G A-o,, X 3^o,n-i ^ / /i(2/)gn(c^?/; 1/"-', x") G M (III.28) 
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is continuous jointly in the variables (x",?/" ^) G X^^n x 3^o,n-i- 
Then the following hold. 
Bl) Let "^cnl-k") e Q^\yo,n,Xo,n) and {K,n(-|2/""') : « = 1,2,...} C Q^^(A'o,„; 3^o,n-i). 
r/zen, the joint measure {P Q.^®Q Qn){dx'^ ^dy"^) — ^ {Po,n®Qo,n){dx"-idy"-):^here Pq_„(-|?/"~-^) G 

52j Ler "^cnl-l^") e Q^^(3^o,n;A'o,n) anJ : « = 1,2,...} C Q^^(A'o,„; 3^o,„-i) 

anJ define the family of joint measures {(-Pon ® o,n){dx^,dy^) : a = 1,2,...} having 
marginals {^q „ : a = 1, 2, . . .} on 3^o,n ^^^(5? {/LtQ_„ : a = 1,2,...} on Xo^n- Then 

marginals of {P ^^^^ ® Qo.^){dx'', dy""). 
B3) The set of measures Q^^(3^o,n; '^cn) is weakly compact. 
B4) The set of measures Q^"'^(^o,n; 3^o,n-i) is weakly compact. 

B5) Let ■3o,n(-|^") e Q^\yo,n,Xo,n), {K,n(-|2/""') : « = 1,2,...} C Q^\Xo,n,yo,n-i): and 
{(J-Qn ■ ^ — 1,2, . . .} are the marginals of {(-Pon ® '^o,n)idx'^,dy^) : a = 1,2, ... }. Then 

where „ G A^i(A:'o,n) is the limit of fi^ ^^ G Mi{Xo^n)- 

Proof Part A. For a = 1, 2, . . ., let ^ o „(-|-) G Q^^(3^o,n; '^cn) be a sequence of forward chan- 
nels and n G N a sequence of the basic joint process corresponding to the back- 
ward channel ^o,n(-|-) e Q'^^{Xoy,yo,n-i) and forward channel ^o,n(-|-) ^ Q*^^(3^o,n; '^o,™)- 
The important steps for the derivation of Al), A2) are given in [21] for stochastic control 
problems with randomized controls. Since we shall use Al), A2), and parts of their derivations 
to show A3)-A5), first we give the details of the derivation of Al) and A2) in Appendix VII-C. 
A3) Next, we show that the family of probability measures ^ o,ni-\y^~^) £ Q*^^('^o,n; 3^o,n-i) is 
weakly compact. By consistency condition CI, any Po,n(-||/"~^) G Q^'^{Xo^n',yo,n-i) uniquely 
defines a family of regular conditional distributions y*"^) G Q{Xi;Xo^i^i x 3^o,j-i), 
i = 0, 1, . . . , n via (II. 1). Hence, (II. 1) can be used to relate weak compactness of the family 
Pi{-; x^~^,y^~^) G Q{Xi; Xq^i^i x 3^o,i-i), « = 0, 1, . . . , n to weak compactness of the family of 
measures ^o,n(-|2/"^^) e Q'^^(A'o,„; 3^o,n-i). 

By recalling the derivation in Appendix VII-C, A2), condition (VII. 5), for K^^ri = Xi=oKi, 
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Ki G B{Xi) compact sets, < z < n, then 

P(i^o,n|y) = / Poidxo) . . . / pn{dxn; x'^'^ y"-^) 

JKo JKi Jk„ 



^ (^~^) / ^^o(rfa;o) / pi{dxi;x^,y°) ... / y'^ 



K, 



2n+i 2" 2^"+-*- 



n--2 



Po{dxo) pi{dxi;x ,y ) . . . Pn-2{dxn-2; x"" ,y'- 

Ko JKi Jk„.2 



>|l-7jiT-S7 / Poidxo) / Pn-2idx^-2; x^-' , y^-') . 



^ ^ ' 'Ko JKi JKr,-2 

By repeating the above procedure the following bound is obtained. 

ei ei ei ei 



P(Ko,„|y) > 1 



2"+! 2" 2"^"'^ ' ' ' 1^ 

n 



> 1 - ei, for any n G N and for every Vo,„(-|y" ^) G Q'^^(A'o,„; 3^o,n-i)- 

Since {Ki : z = 0, 1, . . . , tt,} are compact, from the last inequality it follows that the family of 
measures ^o,n(-|2/"~^) ^ Q^^i^G^n] 3^o,n-i) it tight, and hence weakly compact. This completes 
the derivation of A3). 

A4) Weak compactness of the family of measures ^q„(-|x") G Q*^^(3^o,n; '^o.n) follows from 
the fact that 3^o,n is a compact Polish space. Indeed, any ^o,n('k") ^ Q^^(3^o,ri; '^o,n) uniquely 
defines a family of regular conditional distributions qi{-;y^~^,x^) G Q(3^i; 3^o,i-i x ^oa), i = 
0, 1, . . . , n. It was already shown that given 62 > one can construct compact sets $0 C 3^0) '^'i C 
C y-a-i, such that (see Appendix VII-C, Al)) g„(<l>„; x") > 1 - for 
any yoE^o,yiE^i, . . . ,yn~i&^n-i, x^' G Ao,n- Utilizing the last expression and the unique 
representation of ^Q„(-|a:") given by (II.4), tightness of Q^^(3^o,n; '^o,™) can be shown, and 
hence it is weakly compact. This completes the derivation of A4). 

A5) Utilizing the weak convergence z/q „ Uq (shown under Appendix VII-C, A2)) , we shall 
show weak convergence of the convolution of measures ftg c??/") = Q n{dx"'\y^''~^) ® 

i^o,nidy") ^ 'PoAdxV~')'^iy'o,nidy^) = lto„(rfx",6?y"), whenK,n(-|?/"^') e Q'^\Xo,n;yo,n- 
is fixed. We show weak convergence by considering integrals with respect to go,n{x^)ho,n{y^), 
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where 5'o,n(") G BC{XQ^n) and /io,n(") G -BC(3^o,n)- Let e > be given. Condition CA implies 
that the function mapping 

y"-' e 3^o,n-i ^ / ^(x")Kn(^^a:"|t/"-^) GM (111.29) 

is continuous. Hence, by the weak convergence z/q„ z^q„ and the continuity of the function 
mapping (III. 29) then there exists G N such that for all a > 

r7(x")^0,n(^^x"|2/"-l) I /l(2/")<„(d7/") 

yO,n \ J Xli^n 




< €. 



Since e > is arbitrary, then the derivation of A5) is complete. 

Part B. The methodology is similar to that of Part A., hence it is omitted. □ 

It is important to illustrate how some parts of Theorem III. 5 compare to analogous results 
for mutual information discussed in [20], [27]. To this end, consider Part B., Bl). If we 
use mutual information [_v , Lemma 2], then the sequence of joint measures is defined by 

((iy"|x") ® Pxn{dx"'), and this involves a single convolution, since 
Px" (rfx") is not conditioned on G 3^o,n- Clearly, if the mapping — > Pyn|js(^n (■ Ix") is weakly 
continuous, and P^„((ix") converges weakly to P^„(x"), then P^„Yn{dx'^,dy^) converges 
weakly to PY^\x"idy'^\x'^) ® P^„((ix") = P^r.y^idx'^ .dy'"'), and so does its marginal on 3^o,n- 
On the other hand, if we use directed information, then the joint measure Px" 
®'i=oPYi\Y'-^,X'{dyi\y''~^,x'')i^Pxi\xi-'^,Y^-'^idxi\x^~^,y^^^) involves an (n + l)-fold convolution, 
and Px-|x^-i,y'-i("h ■) is a function of y^~^ G 3^o,n-i5 hence a significant level of additional 
complexity incurs compared to mutual information. Nevertheless, condition CB is the natural 
generalization of the weak continuity of the mapping — > Py"|X"(-|2:") assumed for the mu- 
tual information by Csiszar in [20], to causally conditioned (n+ l)-fold convolutional measures. 



Theorem III. 5 is important for several extremum problems involving directed information. In the 
next Remark we discuss possible applications of the Theorem III.5. 

Remark III.6. 
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1) Consider extremum problems of capacity of channels with memory and feedback. Then, in the 
absence of input constraints, the aim is to find the input conditional distribution P{-\y'^~^) G 
QP^{XQ^n\yQ,n-i)> which achieves the infimum of directed information. To show that such a 



A3) establishes weak compactness. Moreover, in the presence of power constraints P(-|y"^^) G 
'Po,n{P) C Q^^{Xo^n',yo,n-i), to show that such conditional distributions exist, by Prohorov's 
theorem (Appendix VI-A, Theorem VI. 5), it is sufficient to show 'Po,n(-P) is closed and uniformly 
tight subset of the weakly compact set Q^^i^Q n, 3^o,n-i)- 
2) Consider extremum problems of sequential or nonanticipative lossy data compression with 
distortion constraint Qo,n(^) C Q^^ {yo,n] '^o,n) [-^1, 1^3]. Then Theorem 111.5, Part A., A4) 
is important to show weak compactness of Qo^n{D) as a subset of the weakly compact set 
Q^^(3^o,r!.; '^o.n). utilizing Prohorov's theorem (Appendix VTA, Theorem V1.4). For such prob- 
lems, condition CA is replaced by 



is continuous in G X^^n-i, since the source {pn{xn] : n G N} is not affected by past 

reconstructions {i/n : n E N} (see [4]). 

3) Consider extremum problems of capacity for a class of channels with memory and feedback, 
such as, arbitrary varying channels [20]. Such problems are defined by the max-min operations 
of directed information, where the minimizer is over the class of channels [ ^ ]. To investigate 
such capacity problems one has to establish coding theorems, and showing compactness over the 
class of channel conditional distributions, in addition to channel input distributions is critical. 
Theorem 111.5, Part B., B3) establishes weak compactness of this family of channels. 

4) Consider extremum problems of sequential or nonanticipative lossy data compression for a 
class of sources. Then such problems are defined by min-max operations of directed information, 
where the maximizer is over the class of source distributions [35]. To investigate such data 
compression problems one has to establish coding theorems, and to show compactness over the 
class of source distributions, in addition to the reproduction distributions. Theorem 111.5, Part 
A., A3) is crucial. 

In the previous applications. Theorem 111.5 gives the choice of considering either 3^o,n or A<) „ 
to be compact. 



conditional distribution exists, it is sufficient to show compactness 




i- 
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D. Lower Semicontinuity of Directed Information 

We are now ready to utilize the results of Theorem III. 5, to show lower semicontinuity of 
directed information /(X" — Y^) = Ix"^y"( -Po,n, n)- This may be viewed as a general- 
ization of lower semicontinuity of mutual information /(X"; "K") = Ix";y"(-Px", Qy"\X'^), with 
respect to Px^ for fixed Qy"\x", and with respect to Qy"\x^ for fixed Px"- 

Theorem III.7. 

1) Suppose the conditions in Theorem III. 5, Part A., hold. 

Then Ix"-^Y"{Po,n, n) '■^ lower semicontinuous on „ G 2*^^(3^0,71; '^o,n) for fixed P Q^n ^ 
2j Suppose the conditions in Theorem III. 5, Part B., hold. 

Then Ix"^y" ( -Po,n, n) lower semicontinuous on Pq^u ^ Q*""'^('^o,n; 3^o,n-i) for fixed 
Proof. 1) We need to show that for any sequence {Q^ni'l^^) '■ ^ = 1)2,...}, Qon('k") ^ 

Q'^'l^^O.n; A-o,.) such that ^o,n(-k") ^ 'So.nl'l^^") ^Cn 

Ix"^y"(-Po,r!.7 von) — liniinf Ix"-^y"(-Po,n, ^On)- 

' a— >oo ' 

Define the sequence PQ°„((ix", dy") = ( Po.n® n)(^^"' c??/")- Weak convergence PQ^{dx^, dy"-) 

^ -40 

o,n ® Q n) i.dx"' , dy"') = Pq ^{dx"' , dy"") is shown by considering integrals with respect to a 
test function (j)o,n{-, ■)EBC{Xo^n x 3^o,n) via 

By Theorem III.5, Part A., Al), P^.^{dx" , dy") P^^^{dx" , dy") . Similarly, „ = ^o,n ® 
i^Q^, where {z/q „ : a = 1, 2, . . .} are the marginals of {Pq ^ : a = 1, 2, . . .}, and by Theorem in.5, 
Part A., A5) we have 

-4a <— ^ -40 <— 

nO,n = P0,n® Z/o"n ^O.^. = ^0,n ® l^ln- 

Recall the definition of directed information via relative entropy given by 

D(Po,n|| nto.n.) = ©(^0,n, ® "So.n.llKn ® ^0,n,) = Ix"^y" (Kn, "^O.n,)- (ULSO) 
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Relative entropy is lower semicontinuous, hence 

»(^0,nll l?0,n) = »(Kn ® 3o,nll I^O.n) < inf ©(Fq-J | J . (111.31) 

By (III.30) it follows that (111.31) is also equivalent to 

Ix"-5>y"(-Po,n, VOn) — 1™ l^f Hx^^F" ( -Po,n, VOn) 
' o— >oo ' 

Hence, directed information is lower semicontinuous as a functional of ^o,n £ Q^^^l^cn; -^o.n) 

for a fixed ^o,n G Q^"'^('^o,n; 3^o,n-i)- This completes the derivation of 1). 

2) The derivation is similar to 1). □ 

By comparison with lower semicontinuity of mutual information /(^Y"; ¥"■) = Ix"-Y" (-Px" , 

it is clear that directed information imposes additional assumptions for its derivation (e.g., those 

given in Theorem III. 5). 

E. Continuity of Directed Information 

Note that Theorem III. 5 together with Theorem III.7 are important to establish existence of the 
optimal reconstruction distribution for sequential and nonanticipative rate distortion functions 
(by utilizing Weierstrass' Theorem). However, many problems in information theory involve 
extremum problems defined as maximizations of directed information, with respect to the feed- 
back channels {pi{dxi] x^~^,y^~^) : i = 0, 1, . . . , n}, and for such problems it is desirable to 
have upper semicontinuity of directed information with respect to Vo,n- Since by Theorem III. 7, 
directed information is lower semicontinuous with respect to ^o.n £ Q'~^^i'^o,n',yo,n~i), to 
investigate extremum problems involving feedback capacity, it is sufficient to show continuity of 
the functional Ix"->y"(-Po,n5 von) with respect to Po,n G Q^^{Xo,n',yo,n-i) for a fixed 
Continuity of mutual information based on single letter expression is shown in [20, Lemma 
7], and in [36, Theorem 3.2] under weaker conditions. Here, we show continuity of directed 
information by following the procedure in [36], generalized to the directed information functional 
Ix"^y"(^o,n5 ^on)- First, we shall need the following Lemma. 

Lemma III.8. 

For a given ^o,n G Q'^^i^o^n] 3^o,n-i) and ^o,n e Q^^(!yo,n; Xo^n) define 



\^X"^Y"\{Po,n,'^0,n) — / 

J A 
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Then the following inequalities hold. 

Ix"-!>y"(-Po,n; "^On) — \^X"~>Y"\i P 0,n,'^ n) — ^X"^Y'^ { P O.ni'^ n) "I ] — FT' (III. 32) 

em 2 

Proof. The directed information as it is defined in Theorem III.l and (III. 14) can be expressed 
as follows. 



^X"^Y"{Po,n, '^0,n) " ^{Po,n ® ^0,nll ^^O.r 



log 



— ^ ^ 4Po,n® Qo,n) 



'dCPo,n ® 'So,n) \ ( d{P 0,n ® "^O.n) ^ 



log — — )rf(Po,n ® f0,n) 

(III.33) 

The first inequality in (III. 32) is obvious. For the second inequality recall the inequality [29, 
Section 2.3, p. 13] — < xloggX, x G [0, oo), OlogO is assumed to be 0. Then, 

2 

X loga x\ < X logo X H — ; — . (III. 34) 

em 2 

Using the last inequality in (111.33) with x = (^ '^^'"^^"'"^ ^^ establishes the second inequality 
in (in.32). □ 

Now, we are ready to state the Theorem which establishes continuity with respect to weak con- 
vergence of Ix"->y"(*^o,n, ^o,n) ^ fixcd ^o,n, as a functional of Vo,n G Q*^^(^o,n; yo,n^i)- 

Theorem III.9. 

Consider a forward channel n('l^") ^ Q^'^{yo,n', ^^o.n), f^^d a closed family of feedback chan- 
nels Q'^''{XQn', 3^o,n-i) ^ Q'~^^i'^o,n', 3^o,n.-i)- Suppose there exists a family of measures t'o,n(c^y") 
on (3^o,n,S(3^o,n)) such that "Scnl-k") < ^>o,n(rf?/") with RNDs ePo,„(a;^y'^) = 
Furthermore, suppose the following conditions hold. 

A) The family of RNDs „ (x", y") /5 continuous on Xq^^ x 3^o,n, ^inJ ^p^.n (2^"; 1/") log ^^o.n (2;", I/") 
is uniformly integrable over |z>o,„ ® ^o,n : ^o,n e ^'''('^o,™; 3^o,n-i)|- 

5j For a fixed G 3^o,n7 ^/^^ -/?A^Z) Cpo,n(^"i 2/") '■^ uniformly integrable over Q^^Xo n', 3^o,n-i)- 
Then, Ix"-^y"(^o,n, ^o,n.) as a functional of ^ o,n e S'''(A'o,„; 3^o,n.-i)- ^o,n.} ^ Q^^(3^o,n,; '^o.n.) 
is bounded and weakly continuous over Q'^\Xo^n] 3^o,n-i)- 
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Proof. The derivation is shown in Appendix VII-D. □ 

Note that Theorem 111.5 gives conditions for Q^^{Xo^n] 3^o,n-i) to be weakly compact, hence un- 
der these conditions, and the additional conditions A), B), of Theorem IIL9, then Ix"^y"(^o,n, ^ 
is bounded weakly continuous over Q^^iXo^n] 3^o,n-i)- Thus, Theorem 111.5 together with Theo- 
rem III. 9 establish all prerequisites to address extremum problems of capacity of channels with 
memory and feedback described under Remark III. 6. Two such problems are briefly discussed 
in [37]. 



F. Extension of Directed Information to Arbitrary sequences of RV's 

In this section, we demonstrate how the previous results are easily generalized to three, or 
more, sequences of RV's. These extensions have implications in networks communication, and 
in communications with side information at either the transmitter or the receiver [25], [26]. 
To facilitate the demonstration, we consider the following case first. 

Case 1: The sequence of RV's X" e A'o,„ is defined by X" = (X^-" , X^.") e x X^^^ = A'o,n, 
where X^'" = {X/ : z = 0, 1, . . . , n} and X^'" = {Xf : z = 0, 1, . . . , n}. 

Then the two sequences of conditional distributions are {p}{dx},dx'^;x^''^~^,x'^'^~^,y^~^) : i = 
0, 1, . . . , ri} and {q}{dy}; y^'^~^, x^'\ x^'*) : i = 0, 1, . . . , n}, respectively. Consequently, all con- 
struction of consistent families of conditional distributions and the results obtained so far extend 
naturally to directed information I(xi,n,x2>")-i.y"(^o n' ^o.n)' where = 

(^i=Qp}{dxj , dxf; x-^'^~-^ , x'^'^~\ y^~^), and '^l ^{dy^lx^'"^ , x'^'"-) = ®'^=Qqj{dyi;y^~^,x-^'\x'^'^). 
Next, we consider the following case. 

Case 2: The sequence of RV's G 3^o,n is defined by = (y^'", y^,") ^ 3;^^ x = yo,n, 
where F^'" = {F^^ : i = 0, 1, . . . , n} and F^," = {r.2 : z = 0, 1, . . . , n}. 

Then the two sequences of conditional distributions are {pf{dxi; y^'^~^, ?/^'*~^) ■ i = 0,1, . . . ,n 
and {qf{dy},dyf;y^'''~^,y'^'''-~^,x^) : i = 0,1, ... ,n}, respectively. Consequently, all construc- 
tion of consistent families of conditional distributions and the results obtained so far extend 
naturally to directed information Ix"-i^(yi^",y2^")(^o,n) ^o,n.)' where „((ia:"|?/^'"~^, = 
(S)i^oP}{dxi;x'-^,y^''-\y'^''-^), and ^((i?/^''", = (S)i=oql{dyl,dyl,y^''-^,y'^''-^,x'). 

Clearly, Case 1 and Case 2 can be generalized to an arbitrary number of sequences of RV's. 
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IV. Variational Equalities of Directed Information 

In this section we generalize the well-known variational equalities of mutual information [19] 
to directed information. For single letter mutual information these are discussed in [ ' ]. Before 
we pursue this task, we recall the variational equalities of mutual information. It is well-known 
that mutual information /(X*^; Y^) = Ix"-Y" i^Px" J -Py"|X") can be expressed as the minimization 
or maximization of relative entropy functionals as follows [19]. 

Min: Given a reconstruction distribution FYn|X"(dy"|a;'^), a source Fx", and any distribution 
VYnidy"") then 

Ix^MPx-^Pyr^lxr^) = inf / log ( ^^^^f^^j^^^) Py^i^^^^^ 



where the infimum is achieved at Fyn((i?/") = J^^ PY^\x^{dy^\x'^)^x^{dx'^ 



(IV.l) 



Max: Given a channel PY^\x^{dy^\x'^), a distribution Pxr^^dx"^), and any conditional distribution 

Px"|y"(c?a;"|y") then 

Ix";y"(Px",Py"|X") = sup / log ( ^''"l^^^'^fjf^ ] PY'^ixAdy'^lx^PxAdx'') 

Px^lYn e JXo,nXyo,n V r'xn [ClX ) J 

S('^0,n;3^0,n) 

(IV.2) 

and the supremum is achieved at Px«|y"(t^a;"|?/") = j^^'^jy^^^^t^J^^^^ 

Both variational equalities are used in the Blahut-Arimoto algorithm [19], [39] to derive an itera- 
tive computational scheme for capacity and rate distortion via max-min and min-max operations, 
respectively. We shall derive analogous variational equalities for directed information. Before we 
state the main theorem, we introduce some preliminaries. 

LetP(-l-) G Q^i(A'N;3^^)andQ(-|-) G Qp'^iy^] X^),^Xi6.\QiPQ^n[dx'' .dy"") = ^o,n((ix"|y"-i)® 

be the given distribution constructed from the basic feedback channel P(-|y) G 
gci(^^N.-yN^ and forward channel Q(-|x) G Af^) (by projection onto finite number 

of coordinates). 

Let S(-|x) be any measure on (3^^, i3(3^^)) depending parametrically on x G satisfying the 
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consistency condition. 

C3: If F e B{yo^n), then S(F|x) is a i3(A'o,„_i) -measurable. 

Denote this family of measures by S(-|x) G X^). By Remark III, for any family of 

measures S(-|x) on (3^^, -B(3^^)) satisfying consistency condition C3, there exists a collection 
of stochastic kemels {s„,(-; -, ■) G Q(3^„; 3^o,n-i x Xo,n~i) ■ n eN} connected to S(-|x). Indeed, 
given a cylinder set D G i3(3^o,n) of the form 

D = {yey'':yoeDo,y,eD,,...,yne D^}, A G B{y,), < z < n 

then 

S{D\x.) = So{dyo) Si{dyi;yo,Xo) . . . s„(rfy„; ?/''"\ x""^) 

JDa JDi Jd„ 

= ^o,„(xr=oA|x'^-^). (IV.3) 

Unlike the measure ^o,n('|a;") which is conditioned on x" G A'o,n, the measure ^ o,n{-\x^"'^) is 
conditioned on x""^ G X^^n-i- 

Let R(-|y) be any family of measures on {X^ ,B{X'^)) depending parametrically on y G 
satisfying the consistency condition. 

C4: If F G i3(A'o,„), then R(F|y) is a i3(3^o,n) -measurable. 

Denote this family of measures by R(-|y) G Q^^{X'^] y^)- Similarly as before, by Remark II. 1, 
for any family of measures R(-|y) on {X'^ ,B{X^)) satisfying consistency condition C4, there 
exists a collection of stochastic kemels {r.„(-; -, ■) G Q{Xn\ Xo^n-i x 3^o,n) : n G N} connected 
to R(-|y). Thus, given a cylinder set E G B{Xo^n) of the form 

F = {x G A"^ : Xo G Fo, Xi G Fi, . . . , Xn G F„}, Ei G B{Xi), <i <n 

then 

R{E\y) = ro{dxo;yo) ri{dxi] Xo,y^) ■ ■ ■ rn{dxn] x""'^ , y"") 

= ^oAxtoE^\yn■ (IV.4) 

Note the difference between the stochastic kemels {pj((ixj; x*~^, ?/*~^) : i G N} which de- 
fine ^o,n(£^a;"|2/"~^), and the stochastic kemels {rj((ixj; x'~^, y*) : i G N} which define 
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'^oAdx'^ly''). Define their joint distribution (^o,n®^o,n)(c?2:", dy") on (A'^xJ^^, 0„gN'B(A;)0 

B{yn)). 

The following theorem gives two variational equalities for directed information, which are 
analogous to (IV. 1), (IV.2). 

Theorem IV.l. 

Let P(-|y) e Q<^i(A'^;3^^) and Q(-|x) G Q'^^{y^; X^), and construct from them the mea- 
sures Po,nidx^,dy^) = CPo,n ® '^o,n)idx'\dy^)> t{dx^,dy^) = ^Adx^\y^-^) ® voAdy% 
T^o,n{dy'^) = Po,n('^o,n, rfy")- Then we have the following variational equalities. 
Part A. For any arbitrary measure i>o,n ^ -^i(3^o,n) we have 

Ix"^y"(-Po,n5 '^0,n) = D(Po,n| I l?0,n.) 

inf ©(^0,™® '^Onll^O.n® ^>0,n) (IV.5) 

= ,{/ log ( ^°'"^^f If ^ )(Kn ^ ^O.n)idx-,dyn} 

P0,ne>!im),n) l y^-CnXj^Cn ^ ^oAdy ) ' ^ > 

(IV.6) 

and the infimum in (IV.5) is achieved by 

ulMn = lyoAdyn = I CPo,n ® '^o,n)(dx'', dy^) . 

•J Xo^n 

Part B. For any S(-|x) G A"^) and R(-|y) G y^) then 

^0,n)=m)(^0,n||I?0,n) 



= ^ sup I lug 1 

^o,neS"=^(y*';A'N)Jo,neQ'=4(A'N;yN) 
where the supremum in (IV.7) is achieved when the RND satisfies 

Ao,n(a; , y ) = ^= ^ = 1 - a.s., n G N. (IV.8) 

Equivalently, 

,i i A PiidXi;x'~\y'^^)^qi{dyi;y'-\x') • ni n^^a^ 

aAx , y ) = — — — — — — — ^ = 1 — a.s., i = 0,1, . . . ,n. (IV.9) 

Si{dyi;y' \x'' ^) ® ri[dxi;x^ \y') 



log '-^ — jrf(Fo,n ® Qo,n) (IV.7) 



February 19, 2013 



DRAFT 



38 



Moreover, if qi{-;y^ ^, x*) <^ ^,x* ^) and Pi{-; x'' ^,y'^ ^)^rj(-;x* ^,y''), i = 0,1, . . . ,n, 

then 

q.{dy,f-\x^) (IV.IO) 



'-'^s^{dyi;y^-\x'-^) ri{dxi;x'-\t) 
or equivalently, 

qi{dyi;y'~^,x') f pi{dxi; x'~'^ ,y'-'^''^ 



Si V ri{dxi;x^-^,y^) 
Proof. Part A. From Theorem III.l: 



a.s., i = 0,l,...,n. (IV.ll) 



y V y_ /■ {dV^\x'^)\ 4— 

KP0,n ® "Qo.nll^O.n ® %n) = / log ' J L (^0,n ® Q 0,„) rf^^ 



(IV. 12) 



> 



/ log ( ^°'"^f!lf^ ) (Kn ® '3o,n)(^^^", 'iy") + ro(i^o,n| |%n) (IV.13) 



Moreover, equality holds in (IV. 14) when z/o,n = ^o,n- Hence, D(Po,n|| I?o,n) in (III. 19) can be 
expressed via variational equalities (IV.5) and (IV.6). 

Part B. Consider the difference between /(X" F") = D(^o,n®5o,nl I I?o,n) given by (III.19) 
and the right hand side of (IV.7) (without the supremum). Then 

lx^-.y^ (^0,n, 3o,n) - / log ( ^^B'" ^ ^"'"^ ) ^(Kn ® ^O.n) 

dCPo,n®'^oJ\ A- 
— ^= W(i^O,n ® Qo,n) 



log 

> 



/ (i - ® = (IV.15) 

where (IV.15) follows from the inequality log x>l — ^,x>0, which holds with equality if and 
only if X = 1. Furthermore, equality holds in (IV.15) when the RND Aq n(a:", ?/") = = 

1, ^CnO^Cn-a-S- in (X'^,?/"). Since (Po,n®Qo,n)('^0,nX3^0,n) = (5o,n®^0,n)('^0,nX3^0,n) = 

1, this condition is equivalent to ^o,n ® ^o,n = ^o,n ® ^o,n- By conditioning (IV. 8) on 
B{Xo^n~i) B{yo,n-i) one obtains (IV.9). Furthermore, (IV.IO) is obtained from (IV.8), while 
(IV.ll) is obtained by conditioning. □ 
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Next, we discuss the relation between the variational equalities of directed information given by 
(IV.7) and the variational equalities of mutual information given by (IV.2). Clearly, (IV.2) is also 
equivalent to 

sup / log ' , P L J ]Py-\x^{dy-\x-) ® P^^dx^) (IV.16) 

since the RND in (IV.16) is another version of the one in (IV.2). Thus, (IV.7) is understood in 
the context of (IV.16), in which the directed information function is utilized together with the 
decomposition lSo,n® ^o,n- Consider obtaining the analog of the maximizing measure in (IV.2). 
Suppose qi{-\ y'"^, x^) <^ y*"^, x*~^), 2 = 0, 1, . . . , n. Then from (IV.9) we obtain 

r.idx,; x^-\ y^) = { '^'^'^^t^'^^l^ ® Mdxf, x^~\y'^'), ^ = 0,l,...,n 

i = 0,1, . . . ,n. 



Si{dyi;y'-^,x'~^)^ 
qijdyi, x^) ® Pijdxi; x'~^,y'~^) 
Jx.Qiidyi;y'~^,x') '»Pi{dxi;x'-'^,y'-^)' 



Consequently, 

-^^^^^dx'^lyn) = q^{dy,;f~\x')^Pi{dxi;x'-\f~') ^ ^ 

° Ix^Qiidyhy'~^,x'-) ^Pi{dxi;x^-\y''-^)' 
The previous expression is the analog of the maximizing distribution Px"|y" in (IV.2). Finally, 
we note that the optimization in (IV.7) can be done by keeping ^o,n fixed, generated by P(-|-) G 
QCi^;^n.yN^ and Q(-|-) e ^^^(J^^A'^). 

One can obtain the variational equalities (IV.5), (IV.6) of Theorem IV. 1 by using the equivalent 
definition consisting of a family of functions |p„(-; ■, ■) G Q{Xn-, Xo,n-i x J^o.n-i) : n G n| and 
|gn(-; ■, ■) G Qiyn', 3^o,n-i X Xo,n) '■ n G n| , and the alternative directed information expression 
(III. 16). These alternative formulations are described in the following remark. 

Remark IV.2. 

Part A. For any arbitrary stochastic kernel Ui\i-i{dyi; y^~^) G Q(3^i; 3^o,i-i)i i = 0, 1, . . . , n, 
then (IV. 5) can be written as 

/(X"^y") =I;,„^y„(p,(.;.,.),g.(-;-,-) : ^ = 0,l,...,r^) 

n „ 

= - , vol ^ D(g.(,,.)||^.|._i(-;-))(?/^-\^0 

xpiidxi; x'~\ y'-^)®Po^,.^{dy'~\ dx'~^) (IV17) 
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where the infimum is achieved by v*^^^_^{dyi\y^ ^) = Vi\i~i{dyi]y^ i = 0,1,..., n. The 
derivation is similar to the one of Theorem IV. 1, Part A., but it is done with respect to component 

Part B. For any arbitrary family of stochastic kernel {rj(-; -, ■) G Q{Xi] X^^i^i x J^ci-i); ^ = 

0, 1, . . . , n}, and {si{-] -, ■) G 2(3^^; 3^o,i-i x « = 0, 1, . . . , n}, define 

^ • ni \^^f ^ ( Si{dyi;y'-\x'-^) 0ri{dxi;x'-\y')\ 
I{Pi, gi. Si ® n : 2 = 0, 1, . . . , n = > / log — 

^^A-c.xyo.. \Ptidxi;x' \y' ^) ® Vi\i^i{dyuy' ^) j 

(^UoPkidxk, x^~\ y^~^)(^qk{dyk] y^-\ x^). 

Then (IV.7) can be written as 

^ y") = ., .), q^{■■, ■,■): i = 0,l,...,n) 

= sup I{pi,qi,Si(S) n : i = 0,l,...,n) 

Si(-;?;'"^x''"^)i»ri(-;x'-i?/'); i=0,l,...,n 
Si(-;-,-)eS(X;:i^O,i-ixA'o,,_i),ri(-;-,-)GQ{A^i;A'o,i_ixyo,i) 

(IV. 18) 

where the supremum is achieved by (IV.9) and (IV.ll). The derivation is similar to the one of 
Theorem IV. 1, Part B., but it is done with respect to component Si®ri, z = 0, 1, . . . , n. 

V. Conclusion 

In this paper we derive functional and topological properties of directed information defined 
on abstract Polish spaces. We establish convexity of the sets of (nonanticipative) causally condi- 
tioned convolutional distributions, used to defined directed information and we show convexity 
and concavity of directed information with respect to these distributions. We provide a novel 
theorem on weak compactness of causally conditioned convolutional distributions, and weak 
convergence of joint distributions and marginal distributions associated with directed information. 
We use these results to show lower semicontinuity of directed information as a functional of 
two causally conditioned convolutional distributions, and under certain conditions continuity of 
directed information with respect to the causally conditioned convolutional input distributions. 
We also establish two variational equalities for directed information. These functional and 
topological properties may be viewed as generalizations of analogous functional and topological 
properties of mutual information. The results of this paper can be used to address extremum 
problems of directed information functional for point-to-point communications and for network 
communications. 
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VI. Appendix A 

A. Backround material 

In this section, we introduce the basic analytical concepts which are used throughout the paper. 

Weak Convergence and Compactness. 

The main notions discussed are weak convergence of probability measures, the relation to 
convergence with respect to Prohorov metric, tightness of a family of probability measures, 
relative compactness, weak compactness, and compactness. 

Let {X,d) be a metric space, B{X) the cr— algebra of Borel subsets of X, and M.i{X) is the 
family of probability measures on X. Let BC{X) denote the set of bounded, continuous real- 
valued function / on {X,d) endowed with the supremum norm ||/|| = 1/(^)1- From 
[32] it is known that each P E Aii{X) is uniquely determined by the integrals {J^ f{x)dP{x) : 
f e BC{X)}. A sequence {F„ : n > 1} C Mi{X) is said to converge weakly to P e Mi{X) 
if 

lim [ f{x)dPn{x) = [ f{x)dP{x), V/ G BC{X). 
Weak convergence of {P„ : n > 1} to P is denoted by P„ ^ P. 

The space Aii{X) can be made into a Hausdorff topological space by taking basic neighbour- 
hoods of P e A^i(A') all sets the form 



Mx)dQ{x) - / Mx)dP{x) 



< e,i = 1, . . . ,k 



where e > and /i, /2, • • • , /fe G BC{X). The resulting topology is called the topology of weak 
convergence or weak topology. Hence, a sequence {Pn : n > 1} converges to P in this topology 
if and only if P„ P. The space Aii{X) is metrizable with respect to the Prohorov metric 
[40], denoted by £(-, ■), and with respect to this metric Aii{X) is a Hausdorff topological space. 
The following theorem gives fundamental results regarding {Aii{X),C). 

Theorem VI.l. [.;, pp. 43-46] 
Let {X, d) be a metric space. 
1) If X is separable, then Aii{X) is separable. 
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2) If {X,d) is complete and separable, then {JVii[X),C) is a complete separable metric space 
(with respect to the Prohorov metric C). 

3) If {X,d) is compact metric space, then (A^i(A'),£) is a compact metric space. 

Thus, separability is a topological property while completeness is a property of the metric. 
Statement Theorem VI. 1, 3) (due to Prohorov) states that for {X,d) a metric space and X 
compact, then any sequence {P„ : n > 1} of probability measures of X possess a convergent 
subsequence (with respect to the Prohorov metric). 

A crucial result for the characterization of compact subsets of Aii{X) is the next theorem due 
to Prohorov, which relates compactness and tightness of a set of measures. 

Definition VI.2. {Tight Measures) [40, p. 308] 

A probability measure P G is said to be tight if for every e > there exists a compact 

set K C X such that P{K) > 1 — e or P{K'^) < e. Moreover, a family of probability measures 
M C J^i{X) is said to be tight or uniformly tight if for every e > there exists a compact set 
K d X such that infpg^/ P{K) > 1 — e. 

Thus, M = {Pn : n > 1} is uniformly tight if for every e > there exist a compact set K C X 

such that Pn{K) > 1 — e, n = 1,2, The definition of relative and weak compactness is 

defined below. 

Definition VI.3. (Relative and Weak Compactness) 

1) [32, Ch. 1, p. 57] A family of probability measures M C Aii{X) is said to be relatively 
compact (with respect to Prohorov metric) if each sequence {Pn '■ n > 1} of elements in M 
contains some subsequence : i G {1,2,...}} converging (as i — )■ oo) to some probability 
measure P G Aii{X) with respect to the Prohorov metric C{-,-). Here, the limit P is not 
required to belong to M, but all is required is to belong to Aii{X). 

2) [32, Ch. 1, Corollary, p. 59] A family of probability measures M C A^i(A') is said to 
be weakly compact or relatively compact (with respect to weak convergence) if 1) holds with 
convergence tested with respect to weak convergence. 

The following theorem due to Prohorov, relates weak convergence and tightness of a family of 
probability measures. 
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Theorem VIA {Prohorov's Theorem) [40, Theorem A.3.1 5, p. 309] 

Let {X, d) be complete and separable metric space and let a family of probability measures M 
such that M C Then, the family of probability measures M is relatively compact with 

respect to weak convergence if and only if it is tight. 

Note that if {X,d) is an arbitrary metric space and M C Aii{X) a family of probability 
measures, then tightness of M implies relative compactness of M, but the reverse is not valid. 
However, when (X, d) is complete separable metric space, then relative compactness and weak 
compactness are equivalent notions. Therefore, a family of probability measures M C M.i{X) 
on a complete separable metric space {X^d) is weakly compact or relatively compact with 
respect to weak convergence if and only if it is tight. Moreover, if P, then the family 

{Pn : n > 1} is tight. 

Finally, we give another version due to Prohorov for a set of measures M C M.i{X) to be 
compact. 

Theorem VI.5. (Corollary of Prohorov' s Theorem) 

Let {X,d) be a separable metric and M C M.i[X) a set of measures. The following hold. 

(a) If M is closed and tight, then M is compact. 

(b) Suppose X is complete. If M is compact then M is closed and tight. 

That is, for {X,d) a separable metric space a sufficient condition for compactness of M C 
Aii{X) is that M is closed and tight. If X is also complete, then this condition is also necessary. 

Lebesgue's Dominated Convergence Theorems (LDCT). 

In this paper we often need to establish convergence of a sequence of integrals. Sufficient 
conditions are given by LDCT, which allow one to interchange the limit and the integral. We 
state these Lebesgue's theorems below. 

Theorem VI.6. {Lebesgue's Dominated Convergence Theorem (LDCT)) [31, Theorem 3, p. 187] 
Let (Q, T , P) be a probability space and F, X, Xi, X2, . . . be Py/'s such that < Y , for all 
n>\, Ey < 00 and X„ =^ X. 
Then E|X| < 00, 

a) lim„_^oo IEX„ = EX; 
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b) lim,^ooE|X„-X| =0. 
The previous theorem can be generalized as follows. 
Corollary VI.7. [31, Corollary, p. 188]{LDCT in Lp -Spaces) 

Let {Vl, T , P) he a probability space and y, X, Xi, X2, . . . be RV's such that \Xn\ < Y, X„ 
X and KYP < 00 for some p > 0. Then K\X\p < 00 and lim„_!.oo ^\X — X„|p = 0. 

Theorem VI.8. {Generalized LDCT) [42, Exercise 20, p. 59] 

Let {fn ■■ n> 1}, {gn ■■ n > 1}, f,g e L\ /„ ^ /, ^ g, |/„| < gn far all n > 1, and 
lim^^oo / gndF = J gdF. Then lim„^o, / /„dP = / fdF. 

Uniform Integrability. 

In this paper we shall also need stronger sufficient conditions to verify convergence of a sequence 
of integrals using the concept of uniform integrability. We state this next. 

Definition VI.9. {Unifarm Integrability of RV's) [31, Definition 4, p. 188] 

Let {Q,J^,F) be a probability space. A sequence of RV's {X„ : n > 1} is said to be uniformly 

F-integrable if 



Note that if {X„ : n > 1} satisfy < Y and IE{1^} < 00, then the sequence {X„ : tt, > 1} 
is uniformly integrable. Also, for the space LP(r2, J-", P),p > 1, (the space of RV's with finite 



C is uniformly integrable. 

The following theorem gives some properties for a family of uniformly integrable RV's. 

Theorem VI.IO. {Uniform Integrability of RV's) [31, Theorem 4, pp. 188-189] 

Let J^, P) be a probability space and {X„ : n > 1} a uniformly F -integrable family of RV's. 

Then 

(a) E lim inf „ X„ < lim inf „ EX„ < lim sup„ EX„ < E lim sup„ X„. 




Equivalently, 




n>l 



absolute p— th moments, E||X|p} < 00), if £ c L^(r2, J-", P) is bounded for some p > 1 then 
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(b) If in addition ==> X, then ¥,\X\ < oo, limn_j.oo = E|X| and lim„^oolE||X„ — 

X|| = 0. 

Definition VL9 describes uniform integrability when the integrand is a sequence and the proba- 
bility measure is fixed. In the next definition of uniform integrability the integrand is fixed but 
the probability measure is a sequence. 

Definition VI.ll. {Uniform Integrability for a family of probability measures) [36, Appendix, 
Definition A.2, p. 3084] 

Let M C M.i{X) be a family of probability measures on {^X ^B{X)). A measurable function f 
is said to be uniformly integrable over M if 

lim sup / \f{x)\dP{x) — t- 0. 

c^oo pgjvj J {xGX:\f{x)\>c} 

A sufficient condition for the convergence of a sequence of integrals of a function with respect 
to a weakly convergent sequence of measures is the following. 

Theorem VI.12. [36, Appendix, Theorem A.2, p. 3084] 

Let M C J^i{X) be a closed family of probability measures on {X,B{X)), and let {F„ : n > 
1} C M be a weakly convergent sequence in M. If f is a continuous function and uniformly 
integrable over {Pn '■ n>l] then lim„_>.oo / fdPn = J fdP. 

Regular Conditional Probability Measures and their Properties. 

In this paper we extensively utilized conditional distribution and expectation. Hence we shall 
need their precise definitions, given in terms of the Radon-Nikodym Derivative. 
For any two measurable spaces {E,S) and (F, J^), denote hy S & J-', the cr-algebra on E x F 
generated by the collection of all measurable rectangles A x B = {{x,y) : x G A,y ^ B}, 
A C S, B C J-', called the product a-algebra; the measurable space (E x F, S & J-') is called 
the product space of {E,£) and (F, J^). 

Let J-") be a measurable space. Given two probability measures P,Q on (1], J^), Q is said 
to be absolutely continuous with respect to P (denoted P <^ Q) if for every A E J-' such 
that P{A) = then Q{A) = 0. If Q < P, by Radon-Nikodym Derivative theorem, there 
exists a P— integrable and J^— measurable function / such that for every A E J-', Q{A) = 
f(u)dP{u). The function / is unique P — a.e. (almost everywhere) and is called the Radon- 
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Nikodym Derivative (RND) density of Q with respect to P, denoted by f{u) = ^\j^(uj)- That 
is, if / is another function satisfying these properties, there exists a F— null set N such that 
f{uj) = f{uj), yu E N'^ = A^).This relation is denoted by f{uj) = f{uj), P — a.s., and we 
say that the RND / is unique P — a.s. If in addition to Q ^ P, also P <^Q holds, then P and 
Q are called equivalent denoted by Q ~ -P- 

Next, we state the definition of a regular conditional probability measure, from which the regular 
conditional distribution of one RV given another RV can be defined. 

Definition VI.13. {Regular Conditional Probability Measure) [40, A.4, pp. 312-313] 
Let P) be a probability space and Q be a sub-a-algebra of J^. A regular conditional 

probability measure P{-\Q){-) on (fi, J-") is a function P{A\Q){u), A E J^, u E Q having the 
following properties. 

(a) For each A E J^, the function mapping u E Q i — > P{A\Q){(jj) is measurable with respect to 

G. 

(b) For each u E P{-\Q){u) is a probability measure on T . 

(c) For each A E T , P{A\Q){iS) is a version of the conditional probability of A given Q. 
Moreover, 

PiAnB)= [ P{A\g){Lo)PgidLo), \/AEg 

J B 

where Pg is the restriction of P to Q. 

Statements (a) and (c) state that P{A\Q){ijj) is a version of the conditional probability of A given 
Q (and it is a function of uj). If such a version P(-|^)(-) exists then it is unique in the sense that, 
if -P(-|^)(-) is another function with these properties, then there exists a Pg— null set such 
that P{A\g){uj) = P{A\g){uj), ^AEJ^mduE iV^ (e.g., P(-|^)(cj) = P(-|^)(^), Pg - a.s.). 
Thus, a regular conditional probability measure exists if it can be shown that a version of the 
conditional probability measure can be chosen to be a probability measure on T for each co E ^l. 
Although in general, a regular conditional probability measure may not exist, for the case when 
Q is generated by a countable partition of f2, a regular conditional probability measure given 
Q always exists. Moreover, if (fi, d) is a metric space which is complete and separable (Polish 
space), and is a Borel cr— algebra, then for any probability measure P on J^) and any 
sub-cr-algebra ^ C J^, a regular conditional probability measure of P given Q always exists. 
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The next lemma summarizes certain relationships between the absolute continuity of probability 
measures. 

Lemma VI.14. (Absolute Continuity of Probability Measures) [43, Lemma 4.4.7, pp. 149-150] 

a) Suppose Qg < Pg. If Q{-\g){uj) < P{-\g){uj), Qg - a.s., then Q <^ P. 

b) Conversely, if Q <^ P, then Qi-\Q)iuj) < Pi-\G){u), - a.s. 

Note that if Y : i — > iy,A) is a RV on into a measurable space (3^,^) and 3^ 

is a Polish space, then a regular conditional distribution for Y given the sub-a-algebra ^ of J-" 
denoted by P{dy\Q){uj) is defined according to Definition VI. 13, and this always exists. 
Additionally, if X : J^) i — > iX,B) is a RV on (fi, J-") into a measurable space (X,B), and 
Q is the sub-cr-algebra of generated by X, then P{dy\X)(uj) is called the regular conditional 
distribution of Y given X. One can go one step further to define a regular conditional distribution 
for Y given X = a; as a quantity P{dy\X = x), and introduce an equivalent definition called 
stochastic kernel. 

Definition VI.15. {Stochastic Kernel) [40, p. 28 and Theorem A.S. 2, pp. 316-317] 
Consider the measurable spaces {X,B), {y,A). 

A stochastic Kernels is a mapping q : A x X ^ [0,1] satisfying the following two properties: 

1) For every x E X, the set function x) is a probability measure on A; 

2) for every F G A the function q{F; ■) is B-measurable. 

The set of stochastic kernels on y given X is denoted by X). 

The next important result relates a certain measurability called consistency of conditional 
distributions to the existence of a sequence of stochastic kernels. 

Theorem VI.16. {Equivalence of Consistent Family of Measures and a Sequence of Conditional 
Distributions) [21, Chapter 1, Theorem 1.1, pp. 4-5] 

Let {{Xn,B{Xn)) : n E 'H} be complete separable metric spaces (Polish spaces) with B{X„) 
a a— algebra of Borel sets. For any family of measures P(-|y) on {X'^\B{X'^^)) satisfying 
consistency condition 

CI.' If E E B{XQ.n) then P(£'o,n|y) is B{yo,n~i)— measurable function ofyE y^ 

there exists a family of conditional distributions {pn{dxn', x"'~^ , y""'^) : n G N} satisfying 
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conditions 

i) For every n G N, a;'^~\ ?/"~^) is a probability measure on B{Xn); 

ii) For every n eN, p„(A„; is &1~q {B{Xi) Q B{yi)) -measurable function of x'''^ G 
Convexity of Regular Conditional Distributions. 

Next, we give the definition of the well-known convexity properties of the family of conditional 
distributions. 

Let Aii{^l,J-',Q,P), Q C J' denote the set of all probability measures Q on (i^jT) such 
that Q <^ P and Qg ~ Pg. Here P is a fixed probability measure. Let Aii{^l,J-', P\Q){u) 
denote the set of all regular conditional probability measures on J^) conditional on Q defined 
by Miin,J^,P\g){u) = {Qi-\g) ■. Q e Miin,J^,g,P)}. Next, we define the almost sure- 
convexity of the set of regular conditional distributions. 

Definition VI.17. (Almost Sure Convexity of a Set of Regular Conditional Probability Measures) 
We say that the set Aii{fl, J^, P\Q){uj) is convex if the following property holds. 
Given Qi{-\Q){u), Q2{-\g){oj) in M.i{Q,J^, P\Q){u) and A G (0,1), there exists a probability 
measure Q on whose regular conditional probability measure Q{-\Q{u) G P\Q){u) 

satisfies Q{-\Q){uj) = Qi{-\Q){uj) + (1 — X)Q2{-\Q){uj), Pg-a.s. That is, there exists a Pg-null 
set N such that for such D ^ T and for all u G A^'^, 

Q{D\g){uj) = Qi{D\g){uj) + (1 - \)Q2{D\g){uj). 

This property is denoted by AQi(-|^)(a;) + (1 - X)Q2{-\g){uj) G Mi{yL,J^,P\g){uj). 

Relative Entropy. 

Mutual information is often defined via the relative entropy or KuUback-Leibler distance between 
two distributions. Next, we give the definition and the chain rule of relative entropy. 

Definition VI.18. (Relative Entropy) [30, Definition 1.4.1, p. 21] 

Given a measurable space {Z,B{Z)), the relative entropy between two probability measures 
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P e Mi{Z) and Q G Mi{Z) is defined by 





{%yidQ if p«Q 



+00 



otherwise 



where ^ denotes the RND (density) of P with respect to Q, and P « Q denotes absolute 
continuity of Q with respect to P (e.g., Q{A) = for some A G B{Z) then P{A) = 0). 

The previous definition is often used to define mutual information between RV's X and Y via 
I{X] Y) = D(Fx,y||-Px x Py)- However, it is often desirable to express I{X; Y) as a functional 
of distribution Px and conditional distribution Py\x- This is established via the chain rule of 
relative entropy given below. 

Theorem VI.19. (Chain Rule of Relative Entropy) [40, Theorem B.2.1, p. 326] 
Let (X,B) and {y,A) be polish (complete separable metric spaces) spaces and P and Q be 
probability measures on X x y. Let Pi and Qi the first marginals of P and Q [e.g., Pi{A) = 
P(A X y) and Qi{A) = Q{A x 3^), WA G B) respectively, and by a(dy]x) and (3(dy;x) the 
stochastic kernels on y given X for which we have the decomposition, P(dx, dy) = Pi{dx) (g> 
a(dy; x) and Q(dx, dy) = Qi(dx)®l3{dy; x). Then the function mapping x 1— j- H(a{-] x)) 
is measurable and 



By invoking the chain rule of relative entropy with an appropriate choice of P and Q and 



A. Additional Discussion 

1) Directed Information as a Functional of Sequence of Conditional Distributions: The 
equivalent representations (III.10)-(III.12) are obtained from the definition of conditional mutual 
information I{X^;Yi\Y^~^), by invoking the chain rule of relative entropy and the relation 
between absolute continuity of measures (Appendix VI-A, Theorem VI.19 and Lemma VI. 14). 
These are also discussed formally in [ ]. Indeed, if Po,i(", ■', ?/'~^) ^ -Po,j("; y'^^) ^ y^^^), 
t'j-i— almost all G 3^o,i-i then by the Radon-Nikodym theorem there exists a version of 



H{P\\Q) = H{Pi\\Qi)+ [ H{a{-,x)m-,x))Pi{dx). 



J X 



Lemma VL14, it can be shown that I{X]Y) = /_^D(Py|x(-|x)||Py(-)) x Px{dx). 



VII. Appendix B 
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the RND ^i{x\ y') = ^(.;/-i)x^^'_'^|.;^»-i) vd which is non-negative for all z = 0, 1, . . . , n. 
Moreover, by Lemma VI.14, Po,i(-, ■; < ^o,i(-; ?/*"^) x i^i\i-i{-;y'~^), i^o,i-i -almost all 
y'~^ e yo,i^i if and only if gi(-; y'-^ x') < z/i|i„i(-; y*^^), Po,i-almost all {x\y'"^) G A^oi x 
3^o,i-i7 i = 0,1, . . . ,n, and in this case the RND ^^(x*, y*) represents a version of (^i)i 
Po,j— almost all G A^o^j x 3^o,j-i) i = 0, 1, . . . , tt, (formally obtained via the decompo- 

sition Po^i{dx\ dyf, y'~^) = PQ.i{dx'; y'"^) ® y'~^, x')). 

Hence, when the RND ^i{x\y^) exists, repeated application of chain rule. Theorem VI. 19 or 
following the derivation of Theorem in [40, Theorem B.2.1] (this is a very lengthy procedure), 
yields that (III. 12) is obtained from (III. 10). Similarly, starting with (III. 12) one also obtains 
(III. 10). Note that (III. 10) and (III. 12) are valid even when the RND do not exist in which case 
(III. 10) and (III. 12) take the value +oo. 

2) Directed Information as the supremum over finite partitions: In this section we illustrate 
how directed information can be equivalently defined as a supremum over appropriate partition 
of measurable spaces with respect to P(-|-) G Q'^^{X^;y^) and Q(-|-) G Q*^^ {y^ ; X^) . Recall 
that relative entropy or information divergence of two probability measures P and Q (on a 
cr-algebra) of subsets of a Polish space is also defined [40] as 

D(P||Q)= sup ^P(A)log^ (Vn.l) 

where n(Z) denotes the class of all finite measurable partitions of Z. In (VII. 1), the summation 
takes the value if F{A) = and the value +oo if ¥{A) > and Q{A) = 0. Thus, tt = 
{Ai, . . . , Am{u)} £ n(Z) is a partition of Z if Aj, j = l,...,m(n), are measurable and 
Ai n Aj = {0}, \/i j, [J^^^ = Z. According to Dobrushin's theorem [29], under certain 
conditions mutual information 1{X;Y) = D(Px,y||-Px x Py) of two arbitrary RV's X and 
Y can be represented as a supremum of mutual information of discrete RV's corresponding to 
quantization of Y and X. Next, we utilize Dobrushin's theorem to represent directed information. 
Recall the following condition from Pinsker [ ]. Given any set Z and a cr-algebra of its subsets, 
let A{Z) be a class of partitions of vr = {Ai, . . . , ^^(n)} of Z such that 1) any tti G A(Z) 
and TT2 G A(Z) have a common refinement it G A(Z), and 2) the algebra of all finite unions of 
atoms of partitions A(Z) generates the given cr-algebra. Then (VII. 1) is valid with the supremum 
taken over tt G A(2:) C 11(2:). 

For product spaces Z = X xy, 1) and 2) hold for the class of product partitions n = Ax B with 
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atoms Ai x Bj, i = 1, . . . , m(A), j = 1, . . . , m(B), if A and i? range over classes of partitions 
X and y, respectively, having properties 1) and 2). 

Let n = '^o,n X yo,n ^nd consider the class of product partitions 7ro,„ = >^k=o(^k x -Bfc) with 
Ak X atoms x Sfcj, i = 1, . . . , m(Afc), j = 1, . . . , m(Bk), where A^ and B^ range over 
the classes of partitions of and y^, respectively, having properties 1) and 2), k = 0,1, . . . ,n, 
denoted by A(Zo,n)- 

For any P(-|y) G Q'^^iX^; y^) and Q(-|x) G Q'^^{y^; X^), directed information is defined by 
jnf"^ ^7t iiTt ^ ^ i (^o,n ® '3o,n)(^,n) . 

©(^'O.n® Qo,nl|no,n) = SUp 2^ log -j. ( ^ 0,n ® 0,n) (^,n) • 

(VIL2) 

Similarly, an equivalent definition of directed information as a supremum over partitions is 
obtained via (111.18). 



B. Proof of Theorem 111.4 

1) Fix Kn(-|y""') e Q^^(A'o,„;3^o,n-i) and let '^o.nl-k") e ^'''(J^o.n; ^-cO- 

Then, the joint distributions corresponding to ^o,n('k")' ^o,n('l^") 

(Kn ® '3o,n)(^^^", and (^o,n ® '^l,n){dx^ , dy^) , 
and the marginals are 

Since the set Q^^(3^o,n; '^o.n) is convex, given A G (0,1) there exists a probability measure 
P on {X^ X y^,l3{X^) © i5(3^^)) whose regular conditional measure Q(-|x) G x 

y'^,B{X^)&B{y'^),B{y'^),P\B{Xo^n)){x'') satisfies 

and CI holds. Define 

i^oAdyl = Knidyl + (1 - A)^o'n(^2/")- 
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Tntrndnce the RNDs A* fT" v/"! — ^"■"^'^^"^'^"^ v|/« fr" i/""! — ^"■"^'^^"'^"^ /r« I'l/"'! — '^"■"^"'^"^ 

iniroaUCe me KINUS iVo/nl^:^ 5 y J - ulJdy") '^O/nl^ii/J- i.o,„(rfs/") ' -"-O/nly ^ ~ i/o.„(dy") 

and Ao,„(x", y") = ^ = 1, 2. Then, 

Av^iJx^,'^) + (1 - A)v^^,„,(x^,") = A ^°-"^fflf^ + (1 - A)^°-('^""""^ 



_ A"go,„,(cij/"|x") + (1 - A)"go^Jdt/"|x") 

A<„(rfr) + (1 - A)z^o'n(c^?/") 
= Ao,„(x",y") 

and 

A<„(!/") + (1 - A)A-J„(s") = A<4^^ + (1 - A) 



_ A^o\„(rfy") + (1 - A)/yo%(rfy") _ 
A4„(^^l/") + (1 - A)//o%(rfz/") 
Applying the log-sum fomiula [44, Theorem 2.7.1, p. 31] yields 

\^lJx\yniogAUx\y-) + (1 - A)vl;^, Jx", y") log A^ 2/") 
= A4i„(.", !,") log -^jg^ + (1 - A)*S,„(x», y") log 



7t^ nl nA Tt^ nl nA \ / ,\ ^O.n('^;^"K) , X A ^0,n('^y"k ") 

^ I ,^ Qo,n(Q?l/ 1^ ) ^ _ Qo^n(^^1f^ j / A + (1 - Aj 



log 



Integrating the above with respect to i/oAdy^) P oAdx^ly""'^) yields: 



/'3nAdy''\x'')\'3nAdyV) . ^ 

o.nx^o.n ^ j'oAdyv ' y^Adyn 

Jxo.,.xyo.„ ^ J^oAdyV ' 
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(dv"'\x"')\ — 5-2 

+ (1-A) / log ( 2 .! L ' )(Qo,n®'Po,n){dx",dyn. 

Hence, 

Ix"^y" (-Po,n, A^O.n + (1 ~ A)^0,n) ^ AIjc^^-F" ( -Po,n, "^O.n) + ~ A)Ix"^y" ( -Po,n, "^O.n) • 

This completes the derivation of 1). 

2) Fix "5o,n(-|2:") e Q^'(3^o,n;'^o,n) and let K,n(-|?/""'). ^ Q^^ (A'o,„; 3^o,n-l) . 

Then, the joint distributions corresponding to Po,n('l2/" )' P o,n{'\y ) are 

(Hn ® '^o,n){dx\ dy-) and ('Po^„ ® ^o,n){dx\ dy^). 

^1 _ ^2 
The marginals corresponding to P Q^{-\y ), ^on('b" ) are 

Since the set Q^^{Xo^n',yo,n-i) is convex, given A G (0, 1) there exists a probability measure 
P on {X^ X 3;^,i3(A'^) B{y^)) whose regular conditional measure P(-|y) G x 
3^^,i3(A'^) ©i3(3^^),S(A'^),P|i3(:yo,n-i))(y"-i) satisfies 

= AKn(-|2/""') + (1 - A)Hn(-|2/""'). P|5(yo.„-i) " ^'^^ 

and C2 holds. Then, corresponding to ^o,n(c^a^"|y"~^) and we have 

i^oAdyl = [ {xKjdx^ly''-') + (1 - X)%nidx"\y--')) ® ^o,„(^2/1^") 

= Ai.iJdy") + (l-A)i.o%(c^?/"). 

Pick any measure Uo^nidy'^) G A^i(3^o,n) with D(i^o,n| |^o,n) < oo, e.g., such that i^o,n.(-)^^o,n(-)- 
Since ^(■|x")<i^o,n(-)' for almost all G Xo,n, and i^o,n(-)<^o,n(-)' then ,„(-|x")<[/o,„(-), 
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for almost all G A'o„. Consider 



^o^Jcij/"|x")x^o,^(cij/") x 



log ^ 

'■^O.nXyO,?! 

log (;^^^) (4n ® ^2/") 



„®K,n)(c?x",(iy") 



A'o.„xy„.„ ^ Uo,n{dy' 

KAdy^lx 



Hence, 



IX^^Y- {X^n + (1 - A)K,„, ^o,„) = / log ( 



Uo,n{dy^) 



X 4n(rfy"k") ® (AK,Jrfx"|y"-i) + (1 - X)'pljdx-\y--')) - j^^ log ^o.n(rf?/"). 

Moreover, relative entropy is convex in both arguments (e.g., D(- 1 1 f/o,n) is convex for fixed t/o.n) , 
hence 



l^.^y. {X%^ + (1 - X)%^, ^0 J > A / log (%^^^) (^0,. ® ^ 



Kn){dx\dy-) 



+ (1 - A) / log (%^|^) (^0,n ® ^0 ^y") 



^a,nXyo.j 



(1-A)/ .og(|4|^l).L(^!/"). 
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Finally, since /^o,„(-)^f^,n(') ^i^d t'o.nlO^^o.nl") by substituting the following versions 
^o.4^y"k")x.5,Jd,-) ^ z = 1, 2, of the RND for in the first and third right hand side 



expression in the preceding equations yields 

Ix"^y"(AFo,n + (1 - ^) P 0,n^'^ 0,n) > ^^^X^^y" ( -Po.n' "^O.n) + (1 " ^)^X"-^Y" {P Q^n^'^ 0,n) ■ 

This completes the derivation of 2). 

3) Here, it will be shown that for ^o „(-|x"), ^o,n(-|3^") ^ Q^^(3^o,n; ^o,n) such that ^o „(-|x") 7^ 
^o „(-|x"),andA G (0, 1), then I^-^y- (^o,n, A'3o,n + (l- A)3o,n) < Alx-^y- (K,n, 3o,n) + 
(1 - A)Ix"^y" (^o,n, Qo,n,). for a fixed Po,n e Q^^(A'o,„; 3^o,n-i). 
It is already known that Ix"->y"(-Po,n; vo.n) ^ convex functional on 

for a fixed ^o,n £ Q^'^{'^o,n'-, 3^o,n-i)- All is required to show in order to have strict convexity 
is that Ix"-!>y" (^o,n, ^o,n) < This can be easily obtained from part 1) since ^o,n ® 
^o,n < ^o,n ® i^o.n. if and only if ^o,n.(-|2^") < '^oA')^ for /io,n-almost all x" G Af^n. 
Hence, from the strict convexity of the function slogs, s G [0,oo), and the expression of 
directed information as a functional of {^o,n, ^o,n} ^ Q^^l'^o,™; 3^o,n-i) x Q^^(3^o,n; '^o.n), 
with ^o,n = A^o „ + (1 - A)^o,n it follows that 



< I Al0g(' "^°'"^ff - )((go.n® >^0,n)(^x",ciy-) 



= AIx"->y" ( -Po,n.5 Qo,n) + (1 "~ A)Ix"^y" (-Po,ni Q 0,n) 
< 00. 

This completes the derivation of 3). 

C. Proof of Theorem III. 5 

Al) First, it is shown that the joint distribution of the basic joint process {{X-°'\Y-^"^) : 

i G N} converges as a — > 00 to the joint distribution of a joint process {{X-'^\y}'^^) : 
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2 G N} and secondly, that this limiting joint process 

: ^ e N} is also a basic 

joint process corresponding to the backward channel ^o,n('|') ^ Q'^'^{'^o,n',yo,n-i), that is, 
CPo,n ® '^lj{dx^,dy^) ^ {Po,n ® '^lj{dx^,dy^) G Mi{Xo,n x 3^o,n) and that (1^o,n ® 
(5o,„,)(c^a;", c/y") has backward channel Po,n(-|-) e Q*^^(A:'o,n; ^o.n-i)- 

The first part is established by showing weak compactness of the family of measures induced 
by {{xI°'\y}°'^) : z G N}. According to Prohorov's theorem (Appendix VI-A, Theorem VI.4), 
the family of joint probability distributions of the sequence {{xI"\y}"^) : i G N} is weakly 
compact if it is tight (uniformly tight). Therefore, we shall show tightness of the family of 
measures induced by the joint sequence of RV's {{xl°'\ Y^""^) : i G N}. 

Utilizing condition CA for a fixed n eN, for any compact sets Kq E Xq, Ki E Xi, . . . , Kn-i E 
Xn^i the family of distributions {p„(-; ?/"~^) : XqEKq, XiEKi, . . . , Xn-iEKn^i^y"-'^ E 
3^o,n-i} C Q{Xn; A'o,„_ix3^o,n-i) IS compact. Indeed, given any joint sequence {X^o"\ X^"\, 
Y^"\ . . . ,Y^"\} by selecting a subsequence ctj such that the subsequence {Xq°''\ . . . , xl^J:l, 
Yq"'\ . . . , Y^"!^} converges to {Xq^\ . . . , X^^}^, Yq^\ . . . , l^jj^i}, a weakly convergent subse- 
quence of measures xl^'\ . . . , xl^^\,yQ"^\ . . . , yl^^\) is obtained. By Prohorov's theorem, a 
family of measures on a complete separable metric spaces is weakly compact if and only if 
it is tight. Hence, for any sequence of compact sets Kq C Xq, Ki C Xi,...,Kn-i C Xn-i, 
and ei > a compact set K„ C Xn can be constructed such that > 1 — ei, 

for any xq E Kq,xi E Ki, . . . ,Xn-i E G 3^o,n-i- To this end, pick ei > and 

construct the compact sets as follows. Choose compact set Kq C Xq such that po(-^o) > 1 — 
compact set A'l C Xi such that ^1(^1; xq, yo) > 1 — f^-. for any xq E Ko,yo E yo, compact 
set K2 C X2 such that p2{K2] xo,xi,yo,yi) > 1 - fj, for any xq E Ko,Xi E Ki,yo E 
yo,yi e 3^1, and compact set K„ such that xq, . . . , x^-i, ?/o, • • • , 2/„-i) > 1 - 

for any XqEKq, Xn-iEKn-i,y''~^ E 3^o,n-i- 

Likewise, since {3^.; : z G N} are compact Polish spaces, and measures on compact Polish spaces 
are weakly compact (Appendix VI-A, Theorem VI. 1), then the family of conditional distributions 
{qni-; ?/""\ x") : G 3^o,n-i, a:" G Xo^n} C Q(3^„; 3^o,n-i x Xo,n) is weakly compact. Utilizing 
the condition of weak compactness of a family of measures we verify that for any sequence of 
compact sets $0 C yo, $1 C . . . , C J^n-i, and 62 > there exists a compact set $„ C 
y^ such that qni^n, x") > 1 - e2, for any 2/0 e $0, ?/i G $1, . . . , G $„-i, x" G A:'o,„. 
Choose an 62 > and construct compacts C as before: $0 C 3^o is such that (70(^0; a^o) > 
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1 — ^, for any xq E Xq, and after $o C 3^oi C 3^i, . . . , $„_i C 3^„_i are selected we choose 

C yn so that x") > 1 - 2ITT, for any yoE^o, • • • , 2/n~ie$„,_i, G A'c^. 

Based on the above construction, which states that for any n E N the families Pn{-'ri-) G 
Qi^n, A'o,„_i X 3^o,n-i) and -, ■) E Q(3^„; :yo,n-i x '^o.n) are tight, specifically, 

Pn{K^;x'^-\y--') > 1 - > 1 - (VIL3) 

for any xqEKq, . . . , XnEK^ and yo^'^'O) • • • > Z/n-i^'^'n-i, we show uniform tightness of the family 
of joint probability distributions of the joint sequence {{x'f'\Y^°''') : i eH}. Under the above 
choice of {(/Vi, $j) C x : z = 0, 1, . . . , n}, we have the following inequalities. 

p{xi") e Ko, . . . , ^i") e yj") G $0, . . . , Fi"^ E $„} 

= / / Pn((i2:; ^d"'' = a^O, yj"'' = 2/0, • • • , X^n-l = ^n-l, l"n-l = Vn-l) 

J y.'l-^{K,x'S>,) \ Jk„ 

^ = y^, X^") = Xo, . . . , Y^l = Vn-uXjC^ = x)j pjx^") E dxo, . . . , E dxr,.u 

Yi"-^ E dy„ . . . ,Yt\ e dy^^,] 

j P{xi") EKo,..., G JC_i, Fo^") G $0, . . . , G <|.„-i } 

j Pjxi") G Ko, . . . , e fJ") G $0, . . . , Fi^i G } 



(a) 

> I 1- 



> 1 







2n+l / 


(- 


ei 


£2 


2n+l 


2n+l 


ei 


^2 


2n+l 


2n+l 



£2 
2n+l 



The inequality (a) follows from (VII. 3). By repeating the procedure (or by perfect induction) 
we obtain the following inequalities. 

F[xt^ EKo,..., G Kn, Yt^ E . . . , Y^^^ E $„} 

^ ^ _ £1 £2 ei _ £2 _ ei _ £2 ^2 
— 2""'"-'- 2"^"-'- 2" 2" 2"^-*- 2""-'- ' ' ' 2-*- 2-*- 

= 1 - (ei + ea) ^ > 1 - (ei + €2) > 1 - e, for a = 1, 2, . . . , n G N. 

i=l 

Hence, we have shown that the family of measures corresponding to the basic joint process 
: i G N} is tight (uniformly), hence it converges weakly to the measures corre- 
sponding to the joint process {{xf\Y^^^) : i E N}. Thus, {(xf"-*, y/"-*) : z G N} converges 
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in distribution to {{xj^^\Y}^'') : i G N}, e.g., the basic joint process converges in distribution. 
Next, we show that the limiting joint process {{xl^\Y^^^) : z G N} is a basic joint process 
with the same backward channel ^(-l-) G Q^^^i^Q n, 3^o,n-i)- For any n eN, consider bounded 
and continuous real-valued functions gn{-) G BC{Xn) and ^'o,n-i(",") ^ BC{XQ.n^i x 3^o,n-i)- 
By the weak convergence of the joint measures corresponding to {{x\°'\y^°'^) : i G N} to the 
joint measures corresponding to {{xf\Y^^'^) : z G N} denoted by (^o,n ® ^o,n)('^^"> dy"^) 
(^o,n ® ^o,n)(^^"i ^2/")' the continuity of and the continuity of the function mapping 
G Xq^u^i X 3^o,n-i I — > J^^gn{x)pn{dx;x''-^,y'^-^) G M, given e > there exists 
N eN such that for all a > 

9n{x)pn{dx; ?/"-^) ) vl>o,„_i(x"-\ y^'^) FqVi ('^^""' ' ^?/""') 
Since e > is arbitrary, then 

= E{,„(x(°))M/(xr, . . . . . . ,y„n)}. (VII.4) 

Moreover, for all a = 1, 2, . . ., then 

= e| ^ Xi"), . . . , . . . , M/(4"\ . . . , XS, . . . , Fil^ 

Hence, (VII.4) is equivalent to 

lim e| / gr.{x)p.Mx; x!r\ . . . , xi% Yt\ • • • , i;^l)vl^(4"\ • • • , ^S, >^o^"\ • • • , y!Cl)] 

= e| ^7n(xK(dx; , . . . , 41, . . . , ei)*(41 • • • , 41, y!^'\ . . . , 



< e. 
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From the previous equality, which also holds for a = 0, it follows that the next formula holds 
almost surely. 

= / ^„(xK(rfx;X(°\...,41,Fo(°\...,yi!\)-a.5. 
Letting gj^-) be the indicator function Ie, E e B{Xn) then 

This shows that the limiting joint process {{xf\ vj;'^^) : i G N} is a basic process corresponding 
to the backward channel ^o,n(-|-) G Q*^"'^(A'o,„; J^cn-i)- This completes the derivation of Al). 
A2) Here, we show that the marginal distributions generated by the joint sequence of RV's 
{{x\°'\ 1^''"^) : i G N} are weakly compact. In Al) it is shown that for any n G N, given e > 0, 
there are compact sets Kq C Xq.Ki C Xi,. . . ,Kn C Xn such that for any xq G Kq^ xi G 



p„(iC.;x"-\2/"-^)>l-^. (VII.5) 

Utilizing (VII.5) then 
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>|i-4t-- 

2*^+1 2" 

> I 1 - 

2r. 



> fi - A - - - AVI^^"^ e ^0, . . . ,4"^3 e AVs). 

— 1 2*^+-*^ 2" 2"^-*^ /I u; ) n-d " J J 

By repeating the above procedure we obtain 

ei ei ei 



2" 2"-i 21 

n 



j=i 

> 1 - ei, for all a = 1, 2, . . ., 77. G N. 

This shows that the family of marginal distributions of the joint process {{X^°'\ y}"^) : i e N} 
on A'o „ is uniformly tight and hence weakly compact. Consequently, the family of marginal 
distributions of the joint basic process {{xI°'\y}°'^) : i G N} on converges weakly to 
the family of joint process {{X■^\Y^^^) : z G N} on A'o.n- Following the same methodology 
as before, it can be shown that the family of marginal distributions of the joint sequence 
{{xI°'\y}"^) : z G N} on is uniformly tight and hence weakly compact. Indeed, given 
62 > there are compact sets $o C yo-.^i C 3^i, . . . , $.„ C 3^„ such that 



> I — ^2 _ ^2 _ ^2 _ _ ^2 

— 2""*^^ 2" 2""-'^ ' ' ' 2-*^ 
1 

2^ 



1 

1-^2$:^ 



1=1 

> 1 - e2, for all a = 1, 2, . . ., G N. 

This shows that the family of marginal distributions of the joint process {{X^°'\ y}"'') : i G N} 
on 3^o,n is uniformly tight and hence weakly compact. Consequently, the family of marginal 
distributions of the joint process {{X^"\y}°'^) : i G N} on 3^o,n converges weakly to the family 
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of joint process : 2 G N} on 3^o,n- It was shown earlier that the basic process 

{{xI°'\y}"^) : z G N} converges weakly to the joint process : z G N} and that 

this limiting joint process has a same backward channel ^o,n('|2/""^) and a forward channel 
^Q ,„(-|x"). Hence, the marginal distributions of the basic joint process {(xf""*, K/"^) : i G 
N} converge to the marginal distributions of the basic joint process {{X^'^\Y-^'^^) : i G N} 
corresponding to the backward channel ^o,n("|?/"~^) ^ Q*^^('^o,n; 3^o,n-i) and a forward channel 
^o,n('k") ^ Q^^{yo,n] Xo,n)- TMs complctcs the derivation of A2). 



D. Proof of Theorem III.9 

To show continuity of Ix"^y"(-Po,n5 ^o,n) with respect to weak convergence, we need to show 
that for every sequence : a = 1, 2, . . .} such that ^on' have 

The derivation is based on the procedure utilized in [36] to show continuity for single letter 
mutual information. First, decompose the directed information into two terms as follows. 

= / log ( ^°'"^^flf^ ) (K ® ^o,n)(cix", dy-) 

f , (l^O,n{df')\ / , „^ 



<^0,n XJo,; 



6o,„(^",y")logePo,„(^",2/"))Kn(rf^1?/"-')®%n(t^?/") 



where ^.^ <p^ [y'^) = (y") emphasizes the fact that this RND depends on Vo,„. For now, 
assume that both terms in on the right hand side of the above formula are finite; the validity of 
this assumption will be established at the end. Thus, we only need to show that both terms are 
bounded and continuous in the weak sense over Q'^'(Ao,„; 3^o,n-i)- 

Continuity of the first term. Since ^on' Theorem A. 5. 8, p. 320], utilizing 

Lebesgue's dominated convergence theorem, we have z/o,n ® ~^ ^o,n ® ^on- Since 



6o,n(2^"i 2/") is continuous, then so is ^Po,n(a^", 2/") log6o,n(a;", ?/")• By hypothesis, iu^^x'', log^p, Jx", ?/"^ 

February 19, 2013 DRAFT 



62 



is uniformly integrable over {z/q,™ ® ^o,n : ^o,n ^ ^'^'('^o.n; 3^o,n-i)}- Therefore, using Theo- 
rem VI. 12, Appendix VI- A, we conclude that 

= [ 6o,„(^", yl iog6o,„(^", yli^oAdyl ® 'p|^,„(dx"|2/"-i). (vn.v) 

This proves the continuity of the first term. The finiteness of the first term is obtained from 
uniform integrability as follows. For a given e > and sufficiently large c > 



sup I / |6o,n(^">2/")10g6o,n(a^">2/")M{|5^0n{^".y")log?^On(^"-J'")l>c} 

xuoAdyn^Hnidx-ly'"') 



< sup / |ep,,„(x",2/")log6o.„(2;",2/")| 



+ sup / |epo,„(a:",y")logepo,„(^",?/")| 



< e + c. 



Continuity of the second term. For a fixed G J^o.n^ since ?/") is uniformly integrable 

over Q'^''{X(j^n', 3^o,n-i), by Theorem VI. 10, Appendix VI-A, we deduce that Pq^^ Pon 

im- 
plies pointwise convergence of (y^) — > i^^^ "po (?/")• By continuity of the logarithm, 
we obtain the pointwise convergence of L„ ^« (i/")log^ (y") — tjo (?/")log^- (y") 

t^O.ni-TQyj '^0,n,-rg„ I^O,n,-ro_„ I^O,n,-rQ_„ 

It only remains to show convergence under the integral with respect to z>o,n- By (III. 34), then Va 

= ^« Vs:.(!'")i°s?..,..>s.(!'")^S,„(<'-"|y"-') (VII.8) 

£ (ePo,J^",2/")log6o,„(^",2/"))'Po,n(^^^1?/"-'^ 



February 19, 2013 



DRAFT 



63 

where (VII. 8) follows from (VII. 6) and the nonnegativity of Ix"^y"(-Po,rn 5o,n). By (VII.7), 
the integration of the right hand side over Po,n converges. Thus, by the generalized Lebesgue's 
dominated convergence theorem [ p. 59], we conclude that 

This implies the continuity of the second term. Furthermore, its finiteness follows as before. 
Since both terms are finite and continuous we deduce continuity of the directed information 
Ix"^y"(^o,n, ^o,n)- This completes the derivation. 
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